Korean, Edit

Determining Cell Types with Seurat

Recommended article : 【Bioinformatics】 Bioinformatics Analysis Table of Contents


1. Official Tutorial

2. Customization

3. Trouble-shooting


a. Cell Type Classification Pipeline

b. Determining Cell Types with scater

c. Determining Cell Types with scanpy


※ This document is from before the recent upgrade to Seurat v5.0.

※ Seurat is named after Georges Seurat.


1. Official Tutorial (R STUDIO)


library(dplyr)
install.packages("Seurat")
library(Seurat)

### import input_data(cellranger count output과 동일) ###
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

### setup the seurat object ###
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

### Preprocessing(QC) ###
### Low quality cell, mitochondria genome percent check
pbmc[["percent.mt"]] <- PercentageFeatureSet(object = pbmc, pattern = "^MT-")

### Quality Control
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 15)

### normalizing the data ###
pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

### identification of highly variable features ###
pbmc <- FindVariableFeatures(object = pbmc, selection.method = "vst", nfeatures = 2000)

### scaling data ###
all.genes <- rownames(x = pbmc)
pbmc <- ScaleData(object = pbmc, features = all.genes)

### perform linear dimensional reduction ###
pbmc <- RunPCA(object = pbmc, features = VariableFeatures(object = pbmc))

### determine the "dimensionality" of the dataset
pbmc <- JackStraw(object = pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(object = pbmc, dims = 1:20)

### cluster the cells ###
pbmc <- FindNeighbors(object = pbmc, dims = 1:10)
pbmc <- FindClusters(object = pbmc, resolution = 0.5)

### Run non-linear dimensional reduction(tSNE) ###
pbmc <- RunUMAP(object = pbmc, dims = 1:10)
DimPlot(object = pbmc, reduction = "umap")
pbmc <- RunTSNE(object = pbmc, dims = 1:10)
DimPlot(object = pbmc, reduction = "tsne")

### Finding differentially expressed features(biomarkers) ###
markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

### Maybe you need this kind of data ###
umap <- DimPlot(object = pbmc, reduction = "umap")
tsne <- DimPlot(object = pbmc, reduction = "tsne")
names(umap)
# [1] "data"        "layers"      "scales"      "mapping"     "theme"      
# [6] "coordinates" "facet"       "plot_env"    "labels"      "guides"     
write.csv(markers, "markers.csv")
write.csv(umap$data, "umap_data.csv")
write.csv(tsne$data, "tsne_data.csv")

image

Figure. 1. Clusters based on cell types represented by DimPlot


Example Files

⑵ scRNA-seq data represents read count divided by total transcript and multiplied by 10,000

CreateSeuratObject

min.features : A smaller value results in fewer missing data

⑷ The top 2,000 genes with the highest expression are selected using the nfeatures parameter in the FindVariableFeatures function

nfeatures : A larger value enhances clustering. Generally, having more than 2,000 features doesn’t significantly impact clustering

NormalizeData

① scRNA-seq data is scaled using log-normalization

② Subsequently, the data is scaled to z-scores through regression of total cellular read count and mitochondrial read count

ScaleData

① pbmc@assays$RNA@scale.data is created from pbmc@assays$RNA@data

Operation 1: Adjust the zero point of normalized expression stored in pbmc@assays$RNA@data

○ The minimum value of pbmc@assays$RNA@scale.data is usually negative

Operation 2: Values exceeding 10 after scaling are set to 10

○ Without this assignment operation, the Spearman correlation between pbmc@assays$RNA@data and pbmc@assays$RNA@scale.data would be 1: preserving rank

④ True gene expression is better observed through pbmc@assays$RNA@scale.data

○ The 0-10 scale commonly used in bulk-RNA seq such as TCGA (The Cancer Genome Atlas) is inferred to be based on scale.data

RunPCA

① PCA (principal component analysis) is performed for dimensionality reduction: to enhance clustering efficiency

② 50 principal components are utilized for clustering

JackStraw

① Parameters specified on the reference guide can be omitted if given

num.replicate : Having a value of 1 doesn’t affect subsequent steps

ScoreJackStraw

① Parameters specified on the reference guide can be omitted if given

⑽ Clustering is performed through FindClusters, usually with a resolution of 0.5

resolution : A larger value categorizes cell types more diversely

FindClusters employs a graph-based approach

RunUMAP

dims : Can slightly alter the shape of the map

⑿ Wilcoxon rank sum test is used in FindAllMarkers

① A marker gene requires expression in over 25% of cells within a cluster and a fold change greater than 0.25 (log scale)

DotPlot

Link

② Although not shown in the code above, this is frequently used

FeaturePlot

Link

② Although not shown in the code above, this is frequently used



2. Customization

⑴ Overview: If you want to analyze your own data with Seurat

⑵ In the filtered_gene_bc_matrices/hg19/ directory, there are three files

barcodes.tsv : Barcodes for each single cell

genes.tsv : Official names and aliases of human genes

matrix.mtx : Sparse matrix representing gene expression for each barcode and gene

⑶ 1st. Modifying matrix.mtx

① 1st - The matrix with only 1st - 1st data will be given (with rownames and colnames)

② 1st - 2nd - Assign the first value as the position of the non-zero gene indicated by the data in matrix.mtx

③ 1st - 3rd - Assign the second value as the barcode of the single cell indicated by the non-zero data

④ 1st - 4th - Assign the third value as the expression level indicated by the non-zero data

⑤ 1st - 5th - Combine the first, second, and third values into one row to construct the entire matrix (sparse matrix structure)

⑥ 1st - 6th - Save the entire matrix as a text file and convert it to a .mtx file

⑦ 1st - 7th - Replace the original matrix.mtx file and add the first three lines as follows


image

Figure. 2. Format of matrix.mtx


%%MatrixMarket matrix coordinate real general % 32738 1047 2572919

○ 32,738 : Number of human genes, no need to change

○ 1,047 is the number of barcodes, should be adjusted according to the experiment

○ 2,572,919 is the number of non-zero data points, should be adjusted according to the experiment

⑧ Reading matrix.mtx.gz is recommended over matrix.mtx

⑷ 2nd. Modifying barcodes.tsv : Assign barcodes randomly to the number of single cells in the experimental data (without overlap)

① If the barcode order in matrix.mtx exceeds the range of barcode count in barcodes.tsv, the following error message is displayed:

readMM(): column values ‘j’ are not in 1:nc

② Reading barcodes.tsv.gz is recommended over barcodes.tsv

⑸ 3rd. No need to modify genes.tsv

① As Seurat versions update, genes.tsv should be renamed to features.tsv

② Reading genes.tsv.gz is recommended over genes.tsv



3. Trouble-shooting

⑴ “File is too large to open in Notepad.”

① Not possible to add content when text file starts approaching 1 GB

② WordPad is slow and lacks desired features

③ Use programs like gVim 8.0 (Windows) or sublime txt (Mac) to edit large text files

⑵ “Directory provided does not exist”


pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Error in Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/") : 
#   Directory provided does not exist


① Directly unzip files and change the directory address of the generated document

② Files are available on various sites

③ Modify the code as follows


pbmc.data <- Read10X(data.dir = "C:/Users/sun/Desktop/filtered_gene_bc_matrices/hg19/")


⑶ Barcode file missing. Expecting barcodes.tsv.gz

① Recent Seurat pipelines require barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz instead of barcodes.tsv, genes.tsv, matrix.mtx

② If dealing with datasets tailored to older Seurat versions, use ReadMtx instead of Read10X

⑷ “Error: package or namespace load failed for ‘Seurat’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()) versionCheck = vI[[j]]): No package called ‘multtest’”


install.packages("BiocManager")
BiocManager::install('multtest')
install.packages("Seurat")
library(Seurat)


⑸ “Error: cannot allocate vector of size 13.3 GB”

① Cause: Process exceeding memory limit

○ Typically demanding memory during pbmc ← ScaleData(object = pbmc, features = all.genes)

○ Memory is also required in pbmc ← JackStraw(object = pbmc, num.replicate = 100), changing num.replicate to 1 solves it

○ Memory.limit cannot exceed RAM memory

○ Task Manager → Performance → Memory

○ Check total memory capacity

○ Seurat generally uses 7 GB and requires additional memory based on data size (e.g., 13.3 GB)

○ Buying additional RAM can help after checking used slots

○ (Comment) For data analysis, 32 GB RAM seems necessary

② Solution (R 4.1.x version or lower)


memory.size(max = TRUE)  
memory.size(max = FALSE) 
memory.limit(size = NA)  
memory.limit(size = 50000)  


③ Solution (R 4.2.x version or higher)

○ “memory.limit() is no longer supported” error

○ Increase R’s maximum memory limit as follows (Reference)


library(usethis) 
usethis::edit_r_environ()

# Then, type "R_MAX_VSIZE=100Gb" and save .Renviron file.


○ However, in R 4.2.x and above, memory is automatically increased, so no need to worry (Reference)

⑹ “Error: invalid dimnames given for “dgTMatrix” object in dimnamesGets(x, value)”

① Problem: Incorrectly sorted values in genes.tsv causing issues in Read10X

② Ensuring all tuples are in 3 columns will allow smooth progression to the next step

⑺ An old seurat object 23341 genes across 624 samples

① Issue: Old seurat object is difficult to analyze or observe

② Solution: Use UpdateSeuratObject


load('~/Downloads/droplet_Heart_and_Aorta_seurat_tiss.Robj')
pbmc = UpdateSeuratObject(tiss)
DimPlot(pbmc)


⑻ How to save matrix.mtx, barcodes.tsv, and genes.tsv files from a Seurat object. (R)


library(Seurat)

counts <- seurat_obj@assays$RNA@counts 
barcodes <- rownames(seurat_obj@meta.data)
features <- rownames(counts)

library(Matrix)
writeMM(counts, file = "~/Downloads/matrix.mtx")

write.table(barcodes, file = "~/Downloads/barcodes.tsv", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")

gene_ids <- features
gene_names <- features
features_df <- data.frame(gene_ids, gene_names)
write.table(features_df, file = "~/Downloads/features.tsv", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")


⑼ Creating an h5ad file when you have matrix.mtx, barcodes.tsv, and features.tsv (or genes.tsv) (Python)


import os
import glob
from scipy.io import mmread
import pandas as pd
import scanpy as sc

# Set the target directory here
target_directory = '/path/to/your/data.h5ad'

# Iterate over all .mtx files in the target directory
for mtx_file in glob.glob(os.path.join(target_directory, '*matrix.mtx')):
    prefix = mtx_file[:-10]  # Remove 'matrix.mtx' to get the prefix
    
    # Construct file names for barcodes and genes
    barcodes_file = prefix + 'barcodes.tsv'
    features_file = prefix + 'features.tsv'
    genes_file = prefix + 'features.tsv' if os.path.exists(prefix + 'features.tsv') else prefix + 'genes.tsv'
    
    # Check if all files exist
    if os.path.exists(mtx_file) and os.path.exists(barcodes_file) and os.path.exists(genes_file):
        # Load the data
        matrix = mmread(mtx_file).T.tocsr()  # Transpose to have genes as columns
        genes = pd.read_csv(genes_file, header=None, sep='\t')
        barcodes = pd.read_csv(barcodes_file, header=None, sep='\t')

        # Create an AnnData object
        adata = sc.AnnData(X=matrix, obs={'barcodes': barcodes[0].values}, var={'gene_ids': genes[0].values, 'gene_symbols': genes[1].values})
        
        # Output file name
        h5ad_file = prefix + '.h5ad'
        
        # Save the AnnData object
        adata.write(h5ad_file)
        
        # Delete the original files
        os.remove(mtx_file)
        os.remove(barcodes_file)
        os.remove(genes_file)

print("Conversion complete. Original files have been deleted.")


⑽ When there is a tissue_dir directory containing matrix.mtx.gz, barcodes.tsv, features.tsv, and the spatial folder, the code to read Visium data in Python


import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.image import imread
import numpy as np
import cv2
import math

from pathlib import Path, PurePath
from typing import Union, Dict, Optional, Tuple, BinaryIO
import json

def to_spatial(adata, path, library_id = 'TNBC1'):
    
    adata.uns["spatial"][library_id] = dict()
    path = Path(path)
    
    files = dict(
        tissue_positions_file=path / 'spatial/tissue_positions_list.csv',
        scalefactors_json_file=path /  'spatial/scalefactors_json.json',
        hires_image=path / 'spatial/tissue_hires_image.png',
        lowres_image=path / 'spatial/tissue_lowres_image.png',
    )

    # check if files exists, continue if images are missing
    for f in files.values():
        if not f.exists():
            if any(x in str(f) for x in ["hires_image", "lowres_image"]):
                logg.warning(
                    f"You seem to be missing an image file.\n"
                    f"Could not find '{f}'."
                )
            else:
                raise OSError(f"Could not find '{f}'")

    adata.uns["spatial"][library_id]['images'] = dict()
    for res in ['hires', 'lowres']:
        try:
            adata.uns["spatial"][library_id]['images'][res] = imread(
                str(files[f'{res}_image'])
            )
        except Exception:
            raise OSError(f"Could not find '{res}_image'")

    # read json scalefactors
    adata.uns["spatial"][library_id]['scalefactors'] = json.loads(
        files['scalefactors_json_file'].read_bytes()
    )

    #adata.uns["spatial"][library_id]["metadata"] = {
    #    k: (str(attrs[k], "utf-8") if isinstance(attrs[k], bytes) else attrs[k])
    #    for k in ("chemistry_description", "software_version")
    #    if k in attrs
    #}

    # read coordinates
    positions = pd.read_csv(files['tissue_positions_file'], header=None)
    positions.columns = [
        'barcode',
        'in_tissue',
        'array_row',
        'array_col',
        'pxl_col_in_fullres',
        'pxl_row_in_fullres',
    ]
    positions.index = positions['barcode']

    adata.obs = adata.obs.join(positions, how="left")
    
    adata.obsm['spatial'] = adata.obs[
        ['pxl_row_in_fullres', 'pxl_col_in_fullres']
    ].to_numpy()
    
    adata.obs.drop(
        columns=['barcode', 'pxl_row_in_fullres', 'pxl_col_in_fullres'],
        inplace=True,
    )

    return adata


def load_adata(tissue_dir):
    adata1 = sc.read_mtx(tissue_dir + 'matrix.mtx') 
    adata1 = adata1.transpose() 
    cellname = pd.read_csv(tissue_dir + 'barcodes.tsv', sep="\t", header=None, index_col=1, names=['idx', 'barcode']) 
    cellname.index = cellname['idx']
    featurename = pd.read_csv(tissue_dir + 'features.tsv', sep='\t', header=None, index_col=1, names=['gene_ids', 'feature_types'])
    adata1.var = featurename
    adata1.obs = cellname
    adata1.uns["spatial"] = dict()
    adata1 = to_spatial(adata1, tissue_dir, library_id='Library_1')
    adata1 = adata1[adata1.obs.in_tissue == 1,:]
    adata1.var_names_make_unique()
    adata1.obs['unnormalized_counts'] = adata1.X.sum(axis=1).A1
    sc.pp.normalize_total(adata1, inplace=True)
    adata1.obs['normalized_counts'] = adata1.X.sum(axis=1).A1
    sc.pp.log1p(adata1)
    sc.pp.highly_variable_genes(adata1, flavor="seurat", n_top_genes=2000)

    return adata1


⑾ In case matrix.mtx.gz, barcodes.tsv, or features.tsv are corrupted:

① If the matrix.mtx.gz file is corrupted, read it in the R environment using readMM and restore the corrupted file using writeMM.

② For barcodes.tsv and features.tsv, read them in the R environment using read.table and restore the corrupted files using write.table.

⑿ The same code can produce different or similar results depending on the computer

① Reason: Initial conditions, i.e., seed, may slightly differ.

⒀ Input file is .rds


# when loading .rds file
pbmc <- readRDS("Example.rds")

# when saving .rds file
saveRDS(pbmc, file = "Example.rds")

## You can even plot pbmc, when the .rds file is fully processed file
DimPlot(object = pbmc, reduction = "umap")


.rds file: Stores a single variable in R

⒁ Input file is .RData or .Rdata


# when loading .RData file
load(file="Example.RData")


# when saving .RData file

## method 1.
save.image(file="Example.RData")

## method 2.
save(x, file='Example.Rdata')


.RData file: Stores all variables in R

⒂ Input file is .RDS or .Rds


# when loading .RDS file

## method 1.
load(file="Example.RDS") 

## method 2.
pbmc <- readRDS('Example.RDS')


.RDS file: Stores all variables or a single variable in R

⒃ Input file is .h5

① R : The following code reads a .h5 file as a sparse matrix


library(Seurat)
data = Read10X_h5("data.h5")


② Python : The following code reads a .h5 file as AnnData


import scanpy as sc
data = sc.read_10x_h5('data.h5')


⒄ Input file is .h5ad

① R : The following code reads a .h5ad file into an Anndata object (newly established in March 2023)


library(anndata)
data = read_h5ad("data.h5ad")


② Python : The following code reads a .h5ad file into an Anndata object


pbmc = scanpy.read_h5ad("data.h5ad")


③ Save the Seurat object pbmc3k as h5ad (reference).


library(Seurat)
library(SeuratData)
library(SeuratDisk)

SaveH5Seurat(pbmc3k, filename = "~/Downloads/pbmc3k.h5Seurat")
Convert("~/Downloads/pbmc3k.h5Seurat", dest = "h5ad")


⒅ Input file is .rda


load("/Users/parkjeongbin/Downloads/ifnb.SeuratData/data/ifnb.rda")


.rda file: Stores a single variable in R. Unlike .rds, variables are loaded using load function without a return type

② Note that the ipynb.rda file was installed in RStudio

library(SeuratData)

InstallData("ifnb")

⒆ Input file is .Rmd

.ipynb and .Rmd files are mutually convertible

② Code to create .ipynb file from .Rmd file (Reference)

devtools::install_github("mkearney/rmd2jupyter")
library(rmd2jupyter)
rmd2jupyter("MyFILE.Rmd")

⒇ Input file is .Robj

① Input possible using load() function

Reference

load('Data.Robj')

⒇ Input file is .cel (ref1, ref2)

.cel file is a raw data file extension created by Affymetrix

② Code

b <- ReadAffy(filenames='/Users/parkjeongbin/Desktop/614615111406.E02.CEL')

#Extract the perfect match probe-level intensities
dim(pm(b)) #[1] 247965      1

#Phenotypic data : sample information 6X17
dim(pData(b)) #[1] 1      1

#Plateform used for gene information
annotation(b) #[1] "hgu133a"

#Retrieving other types of information
## Use the cfdName() method on the AffyBatch
cdfName(b) #[1] "HG-U133A"
## Use the featureNames() method on the AffyBatch
featureNames(b)
## Use the length() method to count the number of items in a vector
length(featureNames(b)) #[1] 22283
## Use the length() method to count the number of items in the vector containing names of all probes
length(probeNames(b)) #[1] 247965

#Normalize
data.rma = rma(b)
data.matrix = exprs(data.rma)
dim(data.matrix) #[1] 22283      1

#Convert Affymetrix probe ID to ensembl_gene_id, gene_name
## https://www.biostars.org/p/328065/#328328
## https://www.biostars.org/p/332461/#332474
BiocManager::install("biomaRt")
library(biomaRt)

dat<-rownames(data.matrix)
require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
  mart = mart,
  attributes = c(
    "affy_hg_u133_plus_2",
#    "ensembl_gene_id",
#    "gene_biotype",
    "external_gene_name"),
  filter = "affy_hg_u133_plus_2",
  values = dat,
  uniqueRows=TRUE)

○ The characteristics of the above example are as follows

○ Read single file and have only 1 sample: but can also read the folder itself

○ Using an assay called HG-U133A

○ 22283 features: feature means gene, probe_id

○ gene expression is scaled up to 14

⒇ When source file is .ipynb

Jupyter notebook file contains code and execution process recorded

.ipynb files can be generated not only in Python but also in R

③ How to check contents : Anaconda, Google Colab, GitHub

⒇ When source file is .Rmd

① Package to convert Rmd files to ipynb files

### conver Rmd into ipynb ###
devtools::install_github("mkearney/rmd2jupyter")
library(rmd2jupyter)
rmd2jupyter("filename.Rmd")



Input: November 25, 2019 20:22

Modification: December 28, 2022 23:39

results matching ""

    No results matching ""