Cell Type Classification Pipeline
Recommended Post : 【Bioinformatics】 Table of Contents for Bioinformatics Analysis
2. Dimension Reduction with PCA
3. Clustering
4. Exploring DEGs (Differential Expression Genes)
a. Determining Cell Types with Seurat
b. Determining Cell Types with scater
c. Determining Cell Types with scanpy
1. Matrix and Sparse Matrix
⑴ A sparse matrix is a matrix that represents the values and positions of non-zero data.
⑵ Since the amount of data in bioinformatics is large, a sparse matrix is used.
2. Dimension Reduction with PCA
⑴ PCA cannot be directly applied to large datasets, so optimization is necessary.
⑵ Optimization Strategy 1: Use a subset of genes with the highest variability out of around 30,000 genes.
⑶ Optimization Strategy 2: Set the number of genes to explain over 95% of input variance.
3. Clustering
⑴ Clustering is performed with 10 principal components obtained after PCA.
⑵ Clustering algorithms like t-SNE are used.
4. Exploring DEGs (Differential Expression Genes)
⑴ DEG : Genes that are expressed more in one cluster compared to other clusters.
⑵ Methods like t-test analysis or GLM are used.
⑶ Find the biological meaning of genes representing each cluster and assign cell types to each cluster (cannot be automated).
⑷ Representative genes with biological significance are called marker genes.
5. Package Introduction
⑴ In R, packages like Seurat and Scater implement the above pipeline.
⑵ In Python, Scanpy implements the above pipeline.
Input: 2019.11.22 13:49