Understanding and Execution of RCTD
Recommended Article : 【Bioinformatics】 Bioinformatics Analysis Contents
1. Overview
3. Code
4. Results
5. Conclusion
1. Overview
⑴ Limits of existing spatial transcriptomics
① Difficulty in determining the location of specific cell types
○ Reason 1. Difficulty arises because not just single cells but dozens to hundreds of cells can be placed on spatial transcriptomics spots.
○ Reason 2. Even if only a few cells are placed on a spot, they are not recognized as individual cells but as a new single cell.
② Drawbacks of existing unsupervised marker-dependent analysis2
○ Drawback 1. Doesn’t control issues like platform effects, Poisson sampling, over-dispersed counts, etc.
○ Drawback 2. Performance is not significantly superior.
③ Drawbacks of existing MIA (multimodal intersection analysis)3
○ Drawback 1. Doesn’t control issues like platform effects, Poisson sampling, over-dispersed counts, etc.
○ Drawback 2. Not validated in large-scale spatial transcriptomics like Slide-seq.
○ Drawback 3. Difficult to adjust parameters, leading to bias towards enhancement or depletion.
○ Drawback 4. Results are sensitive to parameter changes.
○ Drawbacks 1 and Drawback 2 are points highlighted in the paper, while Drawbacks 3 and Drawback 4 are the author’s perspective as a practitioner.
⑵ RCTD (Robust Cell Type Decomposition)
① Features: Able to distinguish specific cell types in spatial transcriptomics data. Supervised.
② Significance 1: Unlike other supervised methods, it can differentiate multiple cell types within a specific spot.
③ Significance 2: Can control platform effects: Without controlling platform effects in a supervised method, technical variability can hinder successful analysis of biological signals.
2. Mathematical Theory
⑴ Statistical Model
① Essentially follows the Poisson model.
○ Applicable even when UMI count is low, like in Slide-seq.
② i : Each spot. 1, …, I
③ j : Each gene. 1, …, J
④ k : Each cell type. 1, …, K
⑤ Yi,j : Gene expression of gene j at spot i (unit: count)
⑥ λi,j : Random probability variable reflecting platform effect and other natural variability (unit: none)
⑦ Ni : Total transcript count of each spot, referred to as UMI (unique molecular identifier)
⑧ αi : Spot-specific fixed effect
⑨ γj : Gene-specific fixed effect
○ Platform-specific like scRNA-seq, snRNA-seq, Slide-seq, Smart-seq.
○ Assumed to follow a normal distribution with mean 0 and standard deviation σγ.
⑩ μk,j : Mean gene expression of gene j in cell type k (unit: normalized expression)
⑪ βi,k : Weight for cell type k at spot i
○ Condition 1: βi,k + ··· + βi,K = 1
○ Condition 2: βi,k ≥ 0
○ Once β values are determined, the model can be completed.
⑫ εi,j : Random effect influencing gene expression of gene j at spot i. Related to gene-specific overdispersion.
○ Assumed to follow a normal distribution with mean 0 and standard deviation σε.
⑵ Supervised Learning
① Step 1: Assuming μk,j from the reference, it is assumed that these values will be the same in spatial transcriptomics.
○ The average expression of gene j for all single cells of cell type k in the reference.
② Step 2: Exclude genes that do not show significant differences between cell types (∵ to improve analysis performance).
○ Approximately 5,000 genes out of 30,000 genes remain.
③ Step 3: Interestingly, γj can be determined by expanding the following formula.
○ Sub-step 1: Sj ≡ ∑(i = 1 to I) Yi,j
○ Sub-step 2: (First equality) Using E(Y) = λ for Y ~ Poisson(λ) (Poisson distribution)
○ The sum of sample means for Yi,j with respect to i also follows a Poisson distribution. The following is the proof.
○ Property of moment-generating functions: ψX1+X2(t) = ψX1(t) × ψX2(t)
○ Assumption: X1 ~ Poisson(λ1), X2 ~ Poisson(λ2)
○ ψX1(t) = exp(λ1et - λ1), ψX2(t) = exp(λ2et - λ2)
○ ΨX1+X2(t) = exp((λ1+λ2)et - (λ1+λ2))
○ Conclusion: X1 + X2 ~ Poisson(λ1 + λ2)
○ Sub-step 3: (Second equality)
○ 3-1. Substitute λi,j
○ 3-2. Separate the γj term: Easily separable since it is independent of i.
○ 3-3. Change the order of ∑ (i = 1 to I) and ∑ (k = 1 to K)
○ Sub-step 4: If I is sufficiently large, Var(Bk,j) converges to 0 (supplementary method), making Bk,j ≒ βk.
○ Sub-step 5: Estimate β0, βk, σγ through maximum likelihood estimation for Sj.
○ Sub-step 6: Randomly assign γj using σγ.
④ Step 4: (RCTD)
○ Can estimate αi, βi,k, σε for Yi,j through maximum likelihood estimation when varying gene j within a specific spot.
⑶ Optimized Supervised Learning
① Assume a maximum of 2 overlapping cell types per spot: i.e., spot is a singlet or doublet.
② Of course, you can limit the maximum number of cell types per spot or even not set any limitations at all.
③ Optimization prevents overfitting: Singlets have only one cell type per spot, so **doublets are desirable
3. Code: Implemented in R
⑵ https://github.com/dmcable/RCTD
⑶ https://raw.githack.com/dmcable/RCTD/dev/vignettes/spatial-transcriptomics.html
⑷ Required files
Figure. 1. meta_data.csv
Figure. 2. BeadLocationsForR.csv
Figure. 3. MappedDGEForR.csv
4. Results
⑴ Challenge
① Reconfirmation of drawbacks in existing unsupervised marker-dependent analysis
○ Drawback 1: Unclear distinction of marker genes when colocalizing, like Bergmann cells and Purkinje cells : Unsupervised learning recognizes them as a new single cell when they colocalize (Fig. 1a)
○ Drawback 2: Misclassification of granule cells during granule cell classification (Fig. 1b, c)
② Reconfirmation of drawbacks in existing supervised analysis
○ Drawback: Cross-platform cell type classification performance is not significantly excellent (Fig. 1d, e)
③ RCTD can resolve these issues
○ RCTD can effectively remove platform effects (Fig. 2b)
○ RCTD shows high cross-platform single-cell classification accuracy (Fig. 2c)
○ RCTD captures Bergmann cells well even in Bergmann-Purkinje doublet situations (Fig. 3c)
○ If a reference cell type does not exist, it is often misclassified as a wrong cell type (supplementary Fig. 11)
⑵ Mouse Cerebellum Sample
① Structure of mouse cerebellum
Figure. 4. Structure of mouse cerebellum
② snRNA-seq reference × Slide-seqV2
○ The resulting cell types accurately reflect the spatial architecture of the cerebellum (Fig. 4a)
○ RCTD, set in singlet and doublet modes, correctly distinguishes Purkinje cells and Bergmann cells (Fig. 4b, 4c)
○ Comparison with actual markers shows excellent correspondence (Fig. 4d)
③ scRNA-seq reference × Slide-seqV2
○ 95.7% agreement between cell type prediction in snRNA-seq reference × Slide-seqV2
○ Other analyses demonstrate RCTD’s consistency as well
⑶ Hippocampus Sample
① Co-localization of RCTD-based interneuron cell types and interneuron markers (Fig. 5b)
② Co-localization of sub-clusters within RCTD-based interneuron markers and Sst expression (Fig. 5c, 5d)
③ RCTD can demonstrate the influence of cellular environment on gene expression (Fig. 6g)
5. Conclusion
⑴ Limit 1: Assumes that platform effects are common to all cell types : Investigating cell-type-specific platform effects may be necessary.
⑵ Limit 2: Cell types present in spatial data but absent in the reference could be problematic.
⑶ Question : Unclear if the unit of μk,j is normalized expression.
⑷ Application : Could be extended to spatial transcriptomics × spatial transcriptomics analysis.
Input: 2021.06.04 00:34