Determining Cell Types with Seurat
Recommended article : 【Bioinformatics】 Bioinformatics Analysis Table of Contents
a. Cell Type Classification Pipeline
b. Determining Cell Types with scater
c. Determining Cell Types with scanpy
※ This document is from before the recent upgrade to Seurat v5.0.
※ Seurat is named after Georges Seurat.
1. Official Tutorial (R STUDIO)
library(dplyr)
install.packages("Seurat")
library(Seurat)
### import input_data(cellranger count output과 동일) ###
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")
### setup the seurat object ###
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
### Preprocessing(QC) ###
### Low quality cell, mitochondria genome percent check
pbmc[["percent.mt"]] <- PercentageFeatureSet(object = pbmc, pattern = "^MT-")
### Quality Control
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 15)
### normalizing the data ###
pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
### identification of highly variable features ###
pbmc <- FindVariableFeatures(object = pbmc, selection.method = "vst", nfeatures = 2000)
### scaling data ###
all.genes <- rownames(x = pbmc)
pbmc <- ScaleData(object = pbmc, features = all.genes)
### perform linear dimensional reduction ###
pbmc <- RunPCA(object = pbmc, features = VariableFeatures(object = pbmc))
### determine the "dimensionality" of the dataset
pbmc <- JackStraw(object = pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(object = pbmc, dims = 1:20)
### cluster the cells ###
pbmc <- FindNeighbors(object = pbmc, dims = 1:10)
pbmc <- FindClusters(object = pbmc, resolution = 0.5)
### Run non-linear dimensional reduction(tSNE) ###
pbmc <- RunUMAP(object = pbmc, dims = 1:10)
DimPlot(object = pbmc, reduction = "umap")
pbmc <- RunTSNE(object = pbmc, dims = 1:10)
DimPlot(object = pbmc, reduction = "tsne")
### Finding differentially expressed features(biomarkers) ###
markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
### Maybe you need this kind of data ###
umap <- DimPlot(object = pbmc, reduction = "umap")
tsne <- DimPlot(object = pbmc, reduction = "tsne")
names(umap)
# [1] "data" "layers" "scales" "mapping" "theme"
# [6] "coordinates" "facet" "plot_env" "labels" "guides"
write.csv(markers, "markers.csv")
write.csv(umap$data, "umap_data.csv")
write.csv(tsne$data, "tsne_data.csv")
Figure. 1. Clusters based on cell types represented by DimPlot
⑴ Example Files
⑵ scRNA-seq data represents read count divided by total transcript and multiplied by 10,000
⑶ CreateSeuratObject
① min.features : A smaller value results in fewer missing data
⑷ The top 2,000 genes with the highest expression are selected using the nfeatures parameter in the FindVariableFeatures function
① nfeatures : A larger value enhances clustering. Generally, having more than 2,000 features doesn’t significantly impact clustering
⑸ NormalizeData
① scRNA-seq data is scaled using log-normalization
② Subsequently, the data is scaled to z-scores through regression of total cellular read count and mitochondrial read count
⑹ ScaleData
① pbmc@assays$RNA@scale.data is created from pbmc@assays$RNA@data
② Operation 1: Adjust the zero point of normalized expression stored in pbmc@assays$RNA@data
○ The minimum value of pbmc@assays$RNA@scale.data is usually negative
③ Operation 2: Values exceeding 10 after scaling are set to 10
○ Without this assignment operation, the Spearman correlation between pbmc@assays$RNA@data and pbmc@assays$RNA@scale.data would be 1: preserving rank
④ True gene expression is better observed through pbmc@assays$RNA@scale.data
○ The 0-10 scale commonly used in bulk-RNA seq such as TCGA (The Cancer Genome Atlas) is inferred to be based on scale.data
⑺ RunPCA
① PCA (principal component analysis) is performed for dimensionality reduction: to enhance clustering efficiency
② 50 principal components are utilized for clustering
⑻ JackStraw
① Parameters specified on the reference guide can be omitted if given
② num.replicate : Having a value of 1 doesn’t affect subsequent steps
⑼ ScoreJackStraw
① Parameters specified on the reference guide can be omitted if given
⑽ Clustering is performed through FindClusters, usually with a resolution of 0.5
① resolution : A larger value categorizes cell types more diversely
② FindClusters employs a graph-based approach
⑾ RunUMAP
① dims : Can slightly alter the shape of the map
⑿ Wilcoxon rank sum test is used in FindAllMarkers
① A marker gene requires expression in over 25% of cells within a cluster and a fold change greater than 0.25 (log scale)
⒀ DotPlot
① Link
② Although not shown in the code above, this is frequently used
⒁ FeaturePlot
① Link
② Although not shown in the code above, this is frequently used
2. Customization
⑴ Overview: If you want to analyze your own data with Seurat
⑵ In the filtered_gene_bc_matrices/hg19/ directory, there are three files
①
barcodes.tsv
: Barcodes for each single cell
②
genes.tsv
: Official names and aliases of human genes
③
matrix.mtx
: Sparse matrix representing gene expression for each barcode and gene
⑶ 1st. Modifying matrix.mtx
① 1st - The matrix with only 1st - 1st data will be given (with rownames and colnames)
② 1st - 2nd - Assign the first value as the position of the non-zero gene indicated by the data in matrix.mtx
③ 1st - 3rd - Assign the second value as the barcode of the single cell indicated by the non-zero data
④ 1st - 4th - Assign the third value as the expression level indicated by the non-zero data
⑤ 1st - 5th - Combine the first, second, and third values into one row to construct the entire matrix (sparse matrix structure)
⑥ 1st - 6th - Save the entire matrix as a text file and convert it to a .mtx file
⑦ 1st - 7th - Replace the original matrix.mtx file and add the first three lines as follows
Figure. 2. Format of matrix.mtx
%%MatrixMarket matrix coordinate real general % 32738 1047 2572919
○ 32,738 : Number of human genes, no need to change
○ 1,047 is the number of barcodes, should be adjusted according to the experiment
○ 2,572,919 is the number of non-zero data points, should be adjusted according to the experiment
⑧ Reading
matrix.mtx.gz
is recommended over matrix.mtx
⑷ 2nd. Modifying barcodes.tsv : Assign barcodes randomly to the number of single cells in the experimental data (without overlap)
① If the barcode order in matrix.mtx exceeds the range of barcode count in barcodes.tsv, the following error message is displayed:
readMM(): column values ‘j’ are not in 1:nc
② Reading
barcodes.tsv.gz
is recommended overbarcodes.tsv
⑸ 3rd. No need to modify genes.tsv
① As Seurat versions update, genes.tsv should be renamed to
features.tsv
② Reading
genes.tsv.gz
is recommended overgenes.tsv
3. Trouble-shooting
⑴ “File is too large to open in Notepad.”
① Not possible to add content when text file starts approaching 1 GB
② WordPad is slow and lacks desired features
③ Use programs like gVim 8.0 (Windows) or sublime txt (Mac) to edit large text files
⑵ “Directory provided does not exist”
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Error in Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/") :
# Directory provided does not exist
① Directly unzip files and change the directory address of the generated document
② Files are available on various sites
③ Modify the code as follows
pbmc.data <- Read10X(data.dir = "C:/Users/sun/Desktop/filtered_gene_bc_matrices/hg19/")
⑶ Barcode file missing. Expecting barcodes.tsv.gz
① Recent Seurat pipelines require
barcodes.tsv.gz
,features.tsv.gz
,matrix.mtx.gz
instead ofbarcodes.tsv
,genes.tsv
,matrix.mtx
② If dealing with datasets tailored to older Seurat versions, use ReadMtx instead of Read10X
⑷ “Error: package or namespace load failed for ‘Seurat’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()) versionCheck = vI[[j]]): No package called ‘multtest’”
install.packages("BiocManager")
BiocManager::install('multtest')
install.packages("Seurat")
library(Seurat)
⑸ “Error: cannot allocate vector of size 13.3 GB”
① Cause: Process exceeding memory limit
○ Typically demanding memory during
pbmc ← ScaleData(object = pbmc, features = all.genes)
○ Memory is also required in
pbmc ← JackStraw(object = pbmc, num.replicate = 100)
, changing num.replicate to 1 solves it
○ Memory.limit cannot exceed RAM memory
○ Task Manager → Performance → Memory
○ Check total memory capacity
○ Seurat generally uses 7 GB and requires additional memory based on data size (e.g., 13.3 GB)
○ Buying additional RAM can help after checking used slots
○ (Comment) For data analysis, 32 GB RAM seems necessary
② Solution (R 4.1.x version or lower)
memory.size(max = TRUE)
memory.size(max = FALSE)
memory.limit(size = NA)
memory.limit(size = 50000)
③ Solution (R 4.2.x version or higher)
○ “memory.limit() is no longer supported” error
○ Increase R’s maximum memory limit as follows (Reference)
library(usethis)
usethis::edit_r_environ()
# Then, type "R_MAX_VSIZE=100Gb" and save .Renviron file.
○ However, in R 4.2.x and above, memory is automatically increased, so no need to worry (Reference)
⑹ “Error: invalid dimnames given for “dgTMatrix” object in dimnamesGets(x, value)”
① Problem: Incorrectly sorted values in genes.tsv causing issues in Read10X
② Ensuring all tuples are in 3 columns will allow smooth progression to the next step
⑺ An old seurat object 23341 genes across 624 samples
① Issue: Old seurat object is difficult to analyze or observe
② Solution: Use UpdateSeuratObject
load('~/Downloads/droplet_Heart_and_Aorta_seurat_tiss.Robj')
pbmc = UpdateSeuratObject(tiss)
DimPlot(pbmc)
⑻ How to save matrix.mtx
, barcodes.tsv
, and genes.tsv
files from a Seurat object. (R)
library(Seurat)
counts <- seurat_obj@assays$RNA@counts
barcodes <- rownames(seurat_obj@meta.data)
features <- rownames(counts)
library(Matrix)
writeMM(counts, file = "~/Downloads/matrix.mtx")
write.table(barcodes, file = "~/Downloads/barcodes.tsv", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")
gene_ids <- features
gene_names <- features
features_df <- data.frame(gene_ids, gene_names)
write.table(features_df, file = "~/Downloads/features.tsv", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")
⑼ Creating an h5ad file when you have matrix.mtx
, barcodes.tsv
, and features.tsv
(or genes.tsv
) (Python)
import os
import glob
from scipy.io import mmread
import pandas as pd
import scanpy as sc
# Set the target directory here
target_directory = '/path/to/your/data.h5ad'
# Iterate over all .mtx files in the target directory
for mtx_file in glob.glob(os.path.join(target_directory, '*matrix.mtx')):
prefix = mtx_file[:-10] # Remove 'matrix.mtx' to get the prefix
# Construct file names for barcodes and genes
barcodes_file = prefix + 'barcodes.tsv'
features_file = prefix + 'features.tsv'
genes_file = prefix + 'features.tsv' if os.path.exists(prefix + 'features.tsv') else prefix + 'genes.tsv'
# Check if all files exist
if os.path.exists(mtx_file) and os.path.exists(barcodes_file) and os.path.exists(genes_file):
# Load the data
matrix = mmread(mtx_file).T.tocsr() # Transpose to have genes as columns
genes = pd.read_csv(genes_file, header=None, sep='\t')
barcodes = pd.read_csv(barcodes_file, header=None, sep='\t')
# Create an AnnData object
adata = sc.AnnData(X=matrix, obs={'barcodes': barcodes[0].values}, var={'gene_ids': genes[0].values, 'gene_symbols': genes[1].values})
# Output file name
h5ad_file = prefix + '.h5ad'
# Save the AnnData object
adata.write(h5ad_file)
# Delete the original files
os.remove(mtx_file)
os.remove(barcodes_file)
os.remove(genes_file)
print("Conversion complete. Original files have been deleted.")
⑽ When there is a tissue_dir directory containing matrix.mtx.gz
, barcodes.tsv
, features.tsv
, and the spatial folder, the code to read Visium data in Python
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.image import imread
import numpy as np
import cv2
import math
from pathlib import Path, PurePath
from typing import Union, Dict, Optional, Tuple, BinaryIO
import json
def to_spatial(adata, path, library_id = 'TNBC1'):
adata.uns["spatial"][library_id] = dict()
path = Path(path)
files = dict(
tissue_positions_file=path / 'spatial/tissue_positions_list.csv',
scalefactors_json_file=path / 'spatial/scalefactors_json.json',
hires_image=path / 'spatial/tissue_hires_image.png',
lowres_image=path / 'spatial/tissue_lowres_image.png',
)
# check if files exists, continue if images are missing
for f in files.values():
if not f.exists():
if any(x in str(f) for x in ["hires_image", "lowres_image"]):
logg.warning(
f"You seem to be missing an image file.\n"
f"Could not find '{f}'."
)
else:
raise OSError(f"Could not find '{f}'")
adata.uns["spatial"][library_id]['images'] = dict()
for res in ['hires', 'lowres']:
try:
adata.uns["spatial"][library_id]['images'][res] = imread(
str(files[f'{res}_image'])
)
except Exception:
raise OSError(f"Could not find '{res}_image'")
# read json scalefactors
adata.uns["spatial"][library_id]['scalefactors'] = json.loads(
files['scalefactors_json_file'].read_bytes()
)
#adata.uns["spatial"][library_id]["metadata"] = {
# k: (str(attrs[k], "utf-8") if isinstance(attrs[k], bytes) else attrs[k])
# for k in ("chemistry_description", "software_version")
# if k in attrs
#}
# read coordinates
positions = pd.read_csv(files['tissue_positions_file'], header=None)
positions.columns = [
'barcode',
'in_tissue',
'array_row',
'array_col',
'pxl_col_in_fullres',
'pxl_row_in_fullres',
]
positions.index = positions['barcode']
adata.obs = adata.obs.join(positions, how="left")
adata.obsm['spatial'] = adata.obs[
['pxl_row_in_fullres', 'pxl_col_in_fullres']
].to_numpy()
adata.obs.drop(
columns=['barcode', 'pxl_row_in_fullres', 'pxl_col_in_fullres'],
inplace=True,
)
return adata
def load_adata(tissue_dir):
adata1 = sc.read_mtx(tissue_dir + 'matrix.mtx')
adata1 = adata1.transpose()
cellname = pd.read_csv(tissue_dir + 'barcodes.tsv', sep="\t", header=None, index_col=1, names=['idx', 'barcode'])
cellname.index = cellname['idx']
featurename = pd.read_csv(tissue_dir + 'features.tsv', sep='\t', header=None, index_col=1, names=['gene_ids', 'feature_types'])
adata1.var = featurename
adata1.obs = cellname
adata1.uns["spatial"] = dict()
adata1 = to_spatial(adata1, tissue_dir, library_id='Library_1')
adata1 = adata1[adata1.obs.in_tissue == 1,:]
adata1.var_names_make_unique()
adata1.obs['unnormalized_counts'] = adata1.X.sum(axis=1).A1
sc.pp.normalize_total(adata1, inplace=True)
adata1.obs['normalized_counts'] = adata1.X.sum(axis=1).A1
sc.pp.log1p(adata1)
sc.pp.highly_variable_genes(adata1, flavor="seurat", n_top_genes=2000)
return adata1
⑾ In case matrix.mtx.gz
, barcodes.tsv
, or features.tsv
are corrupted:
① If the
matrix.mtx.gz
file is corrupted, read it in the R environment usingreadMM
and restore the corrupted file usingwriteMM
.
② For
barcodes.tsv
andfeatures.tsv
, read them in the R environment usingread.table
and restore the corrupted files usingwrite.table
.
⑿ The same code can produce different or similar results depending on the computer
① Reason: Initial conditions, i.e., seed, may slightly differ.
⒀ Input file is .rds
# when loading .rds file
pbmc <- readRDS("Example.rds")
# when saving .rds file
saveRDS(pbmc, file = "Example.rds")
## You can even plot pbmc, when the .rds file is fully processed file
DimPlot(object = pbmc, reduction = "umap")
①
.rds
file: Stores a single variable in R
⒁ Input file is .RData
or .Rdata
# when loading .RData file
load(file="Example.RData")
# when saving .RData file
## method 1.
save.image(file="Example.RData")
## method 2.
save(x, file='Example.Rdata')
①
.RData
file: Stores all variables in R
⒂ Input file is .RDS
or .Rds
# when loading .RDS file
## method 1.
load(file="Example.RDS")
## method 2.
pbmc <- readRDS('Example.RDS')
①
.RDS
file: Stores all variables or a single variable in R
⒃ Input file is .h5
① R : The following code reads a
.h5
file as a sparse matrix
library(Seurat)
data = Read10X_h5("data.h5")
② Python : The following code reads a
.h5
file as AnnData
import scanpy as sc
data = sc.read_10x_h5('data.h5')
⒄ Input file is .h5ad
① R : The following code reads a
.h5ad
file into an Anndata object (newly established in March 2023)
library(anndata)
data = read_h5ad("data.h5ad")
② Python : The following code reads a
.h5ad
file into an Anndata object
pbmc = scanpy.read_h5ad("data.h5ad")
③ Save the Seurat object pbmc3k as
h5ad
(reference).
library(Seurat)
library(SeuratData)
library(SeuratDisk)
SaveH5Seurat(pbmc3k, filename = "~/Downloads/pbmc3k.h5Seurat")
Convert("~/Downloads/pbmc3k.h5Seurat", dest = "h5ad")
⒅ Input file is .rda
load("/Users/parkjeongbin/Downloads/ifnb.SeuratData/data/ifnb.rda")
①
.rda
file: Stores a single variable in R. Unlike.rds
, variables are loaded using load function without a return type
② Note that the ipynb.rda file was installed in RStudio
library(SeuratData)
InstallData("ifnb")
⒆ Input file is .Rmd
①
.ipynb
and.Rmd
files are mutually convertible
② Code to create
.ipynb
file from.Rmd
file (Reference)
devtools::install_github("mkearney/rmd2jupyter")
library(rmd2jupyter)
rmd2jupyter("MyFILE.Rmd")
⒇ Input file is .Robj
① Input possible using
load()
function
load('Data.Robj')
⒇ Input file is .cel
(ref1, ref2)
①
.cel
file is a raw data file extension created by Affymetrix
② Code
b <- ReadAffy(filenames='/Users/parkjeongbin/Desktop/614615111406.E02.CEL')
#Extract the perfect match probe-level intensities
dim(pm(b)) #[1] 247965 1
#Phenotypic data : sample information 6X17
dim(pData(b)) #[1] 1 1
#Plateform used for gene information
annotation(b) #[1] "hgu133a"
#Retrieving other types of information
## Use the cfdName() method on the AffyBatch
cdfName(b) #[1] "HG-U133A"
## Use the featureNames() method on the AffyBatch
featureNames(b)
## Use the length() method to count the number of items in a vector
length(featureNames(b)) #[1] 22283
## Use the length() method to count the number of items in the vector containing names of all probes
length(probeNames(b)) #[1] 247965
#Normalize
data.rma = rma(b)
data.matrix = exprs(data.rma)
dim(data.matrix) #[1] 22283 1
#Convert Affymetrix probe ID to ensembl_gene_id, gene_name
## https://www.biostars.org/p/328065/#328328
## https://www.biostars.org/p/332461/#332474
BiocManager::install("biomaRt")
library(biomaRt)
dat<-rownames(data.matrix)
require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
mart = mart,
attributes = c(
"affy_hg_u133_plus_2",
# "ensembl_gene_id",
# "gene_biotype",
"external_gene_name"),
filter = "affy_hg_u133_plus_2",
values = dat,
uniqueRows=TRUE)
○ The characteristics of the above example are as follows
○ Read single file and have only 1 sample: but can also read the folder itself
○ Using an assay called HG-U133A
○ 22283 features: feature means gene, probe_id
○ gene expression is scaled up to 14
⒇ When source file is .ipynb
① Jupyter notebook file contains code and execution process recorded
②
.ipynb
files can be generated not only in Python but also in R
③ How to check contents : Anaconda, Google Colab, GitHub
⒇ When source file is .Rmd
① Package to convert
Rmd
files toipynb
files
### conver Rmd into ipynb ###
devtools::install_github("mkearney/rmd2jupyter")
library(rmd2jupyter)
rmd2jupyter("filename.Rmd")
Input: November 25, 2019 20:22
Modification: December 28, 2022 23:39