Korean, Edit

Understanding and Execution of RCTD

Recommended Article : 【Bioinformatics】 Bioinformatics Analysis Contents


1. Overview

2. Mathematical Theory

3. Code

4. Results

5. Conclusion



1. Overview

⑴ Limits of existing spatial transcriptomics

① Difficulty in determining the location of specific cell types

Reason 1. Difficulty arises because not just single cells but dozens to hundreds of cells can be placed on spatial transcriptomics spots.

Reason 2. Even if only a few cells are placed on a spot, they are not recognized as individual cells but as a new single cell.

② Drawbacks of existing unsupervised marker-dependent analysis2

Drawback 1. Doesn’t control issues like platform effects, Poisson sampling, over-dispersed counts, etc.

Drawback 2. Performance is not significantly superior.

③ Drawbacks of existing MIA (multimodal intersection analysis)3

Drawback 1. Doesn’t control issues like platform effects, Poisson sampling, over-dispersed counts, etc.

Drawback 2. Not validated in large-scale spatial transcriptomics like Slide-seq.

Drawback 3. Difficult to adjust parameters, leading to bias towards enhancement or depletion.

Drawback 4. Results are sensitive to parameter changes.

Drawbacks 1 and Drawback 2 are points highlighted in the paper, while Drawbacks 3 and Drawback 4 are the author’s perspective as a practitioner.

⑵ RCTD (Robust Cell Type Decomposition)

① Features: Able to distinguish specific cell types in spatial transcriptomics data. Supervised.

Significance 1: Unlike other supervised methods, it can differentiate multiple cell types within a specific spot.

Significance 2: Can control platform effects: Without controlling platform effects in a supervised method, technical variability can hinder successful analysis of biological signals.



2. Mathematical Theory

⑴ Statistical Model

① Essentially follows the Poisson model.

○ Applicable even when UMI count is low, like in Slide-seq.

② i : Each spot. 1, …, I

③ j : Each gene. 1, …, J

④ k : Each cell type. 1, …, K

⑤ Yi,j : Gene expression of gene j at spot i (unit: count)

⑥ λi,j : Random probability variable reflecting platform effect and other natural variability (unit: none)

⑦ Ni : Total transcript count of each spot, referred to as UMI (unique molecular identifier)

⑧ αi : Spot-specific fixed effect

⑨ γj : Gene-specific fixed effect

○ Platform-specific like scRNA-seq, snRNA-seq, Slide-seq, Smart-seq.

○ Assumed to follow a normal distribution with mean 0 and standard deviation σγ.

⑩ μk,j : Mean gene expression of gene j in cell type k (unit: normalized expression)

⑪ βi,k : Weight for cell type k at spot i

Condition 1: βi,k + ··· + βi,K = 1

Condition 2: βi,k ≥ 0

○ Once β values are determined, the model can be completed.

⑫ εi,j : Random effect influencing gene expression of gene j at spot i. Related to gene-specific overdispersion.

○ Assumed to follow a normal distribution with mean 0 and standard deviation σε.

⑵ Supervised Learning

Step 1: Assuming μk,j from the reference, it is assumed that these values will be the same in spatial transcriptomics.

○ The average expression of gene j for all single cells of cell type k in the reference.

Step 2: Exclude genes that do not show significant differences between cell types ( to improve analysis performance).

○ Approximately 5,000 genes out of 30,000 genes remain.

Step 3: Interestingly, γj can be determined by expanding the following formula.

Sub-step 1: Sj ≡ ∑(i = 1 to I) Yi,j

Sub-step 2: (First equality) Using E(Y) = λ for Y ~ Poisson(λ) (Poisson distribution)

○ The sum of sample means for Yi,j with respect to i also follows a Poisson distribution. The following is the proof.

○ Property of moment-generating functions: ψX1+X2(t) = ψX1(t) × ψX2(t)

Assumption: X1 ~ Poisson(λ1), X2 ~ Poisson(λ2)

○ ψX1(t) = exp(λ1et - λ1), ψX2(t) = exp(λ2et - λ2)

○ ΨX1+X2(t) = exp((λ1+λ2)et - (λ1+λ2))

Conclusion: X1 + X2 ~ Poisson(λ1 + λ2)

Sub-step 3: (Second equality)

3-1. Substitute λi,j

3-2. Separate the γj term: Easily separable since it is independent of i.

3-3. Change the order of ∑ (i = 1 to I) and ∑ (k = 1 to K)

Sub-step 4: If I is sufficiently large, Var(Bk,j) converges to 0 (supplementary method), making Bk,j ≒ βk.

Sub-step 5: Estimate β0, βk, σγ through maximum likelihood estimation for Sj.

Sub-step 6: Randomly assign γj using σγ.

Step 4: (RCTD)

○ Can estimate αi, βi,k, σε for Yi,j through maximum likelihood estimation when varying gene j within a specific spot.

⑶ Optimized Supervised Learning

① Assume a maximum of 2 overlapping cell types per spot: i.e., spot is a singlet or doublet.

② Of course, you can limit the maximum number of cell types per spot or even not set any limitations at all.

③ Optimization prevents overfitting: Singlets have only one cell type per spot, so **doublets are desirable



3. Code: Implemented in R

Install R

https://github.com/dmcable/RCTD

https://raw.githack.com/dmcable/RCTD/dev/vignettes/spatial-transcriptomics.html

⑷ Required files

Figure. 1. meta_data.csv

Figure. 2. BeadLocationsForR.csv

Figure. 3. MappedDGEForR.csv



4. Results

⑴ Challenge

① Reconfirmation of drawbacks in existing unsupervised marker-dependent analysis

Drawback 1: Unclear distinction of marker genes when colocalizing, like Bergmann cells and Purkinje cells : Unsupervised learning recognizes them as a new single cell when they colocalize (Fig. 1a)

Drawback 2: Misclassification of granule cells during granule cell classification (Fig. 1b, c)

② Reconfirmation of drawbacks in existing supervised analysis

○ Drawback: Cross-platform cell type classification performance is not significantly excellent (Fig. 1d, e)

③ RCTD can resolve these issues

○ RCTD can effectively remove platform effects (Fig. 2b)

○ RCTD shows high cross-platform single-cell classification accuracy (Fig. 2c)

○ RCTD captures Bergmann cells well even in Bergmann-Purkinje doublet situations (Fig. 3c)

○ If a reference cell type does not exist, it is often misclassified as a wrong cell type (supplementary Fig. 11)

⑵ Mouse Cerebellum Sample

① Structure of mouse cerebellum

Figure. 4. Structure of mouse cerebellum

② snRNA-seq reference × Slide-seqV2

○ The resulting cell types accurately reflect the spatial architecture of the cerebellum (Fig. 4a)

○ RCTD, set in singlet and doublet modes, correctly distinguishes Purkinje cells and Bergmann cells (Fig. 4b, 4c)

○ Comparison with actual markers shows excellent correspondence (Fig. 4d)

③ scRNA-seq reference × Slide-seqV2

○ 95.7% agreement between cell type prediction in snRNA-seq reference × Slide-seqV2

○ Other analyses demonstrate RCTD’s consistency as well

⑶ Hippocampus Sample

① Co-localization of RCTD-based interneuron cell types and interneuron markers (Fig. 5b)

② Co-localization of sub-clusters within RCTD-based interneuron markers and Sst expression (Fig. 5c, 5d)

③ RCTD can demonstrate the influence of cellular environment on gene expression (Fig. 6g)



5. Conclusion

Limit 1: Assumes that platform effects are common to all cell types : Investigating cell-type-specific platform effects may be necessary.

Limit 2: Cell types present in spatial data but absent in the reference could be problematic.

⑶ Question : Unclear if the unit of μk,j is normalized expression.

⑷ Application : Could be extended to spatial transcriptomics × spatial transcriptomics analysis.



Input: 2021.06.04 00:34

results matching ""

    No results matching ""