Cell Type Classification Pipeline

1. Matrix and Sparse Matrix

⑴ A sparse matrix is a matrix that represents the values and positions of non-zero data.

⑵ Since the amount of data in bioinformatics is large, a sparse matrix is used.

⑴ PCA cannot be directly applied to large datasets, so optimization is necessary.

⑵ Optimization Strategy 1: Use a subset of genes with the highest variability out of around 30,000 genes.

⑶ Optimization Strategy 2: Set the number of genes to explain over 95% of input variance.

⑴ Clustering is performed with 10 principal components obtained after PCA.

⑵ Clustering algorithms like t-SNE are used.

⑴ DEG : Genes that are expressed more in one cluster compared to other clusters.

⑵ Methods like t-test analysis or GLM are used.

⑶ Find the biological meaning of genes representing each cluster and assign cell types to each cluster (cannot be automated).

⑷ Representative genes with biological significance are called marker genes.

⑴ In R, packages like Seurat and Scater implement the above pipeline.

⑵ In Python, Scanpy implements the above pipeline.

Input: 2019.11.22 13:49