Korean, Edit

Transcriptomics Pipeline

Recommended post: 【Bioinformatics】 Bioinformatics Analysis Table of Contents


1. QC 1. Tissue QC

2. QC 2. Data QC

3. QC 3. Filtering

4. QC 4. Alignment

5. QC 5. Normalization

6. QC 6. Batch Effect

7. Common 1. Clustering

8. Common 2. Differentially Expressed Gene (DEG) Analysis

9. Common 3. Gene Set Analysis

10. Common 4. Gene Interaction Analysis

11. Common 5. Cell Type Mapping Analysis

12. Advanced 1. Alternative Splicing Analysis (AS Analysis)

13. Advanced 2. Trajectory Analysis

14. Advanced 3. Epigenomics Analysis

15. Advanced 4. Special Transcriptpmics Analysis

16. Advanced 5. Database Utilization


a. Sequencing Technologies

b. Seurat Pipeline



image

Figure 1. Analysis Guides


1. QC 1. Tissue QC

⑴ Definition: Metric for assessing tissue quality

Type 1: RIN (RNA Integrity Number)

① Measured by Agilent 2100 Bioanalyzer

② RIN = 10: Intact RNA

③ RIN = 1: Completely degraded RNA

④ RIN > 7: Generally suitable quality level for RNA-seq

Type 2: DV200: Percentage of fragments around 200 nt in FFPE tissues, indicating RNA fragmentation



2. QC 2. Data QC

⑴ Definition: Metric for assessing data quality

Type 1: QC metrics for the data itself: Used for external validity confirmation by manual inspection or comparing with other datasets

① QC metrics

dUTP method: “_1.fastq” represents first strand (anti-sense), “_2.fastq” represents second strand (sense; original RNA sequence)

○ High non-coding RNA ratio and high duplication indicate poor RNA quality

○ High GC content: Suggests potential rRNA contamination; requires filtering of 5S, 18S, 28S rRNA

○ Low GC content: Indicates potential issues with reverse transcription

○ DNA-seq generally removes duplicates, while RNA-seq typically doesn’t

○ RNA quality is low if unique molecule count is below 10%

○ Small fragments: Adaptor binding starts, requiring adaptor trimming

○ Exonic portion: 50-70% in poly-A(+) RNA-seq, reduced in rRNA(-) RNA-seq

② Other datasets: 10x Genomics, GEO, ZENODO, etc.

Method 1: FastQC program

Method 2: conda Fastqc command (Linux)

Method 3: SRA (Sequence Reads Archive) Toolkit with fastqc command; example output provided


sudo apt install fastqc
cd sratoolkit.3.0.5-ubuntu64/
cd bin
fastqc DRR016938.fastq


Example file


Method 4: SnakeMake: Integrated pipeline with QC functionality

Snakefile: A Snakemake script based on Python. The filename itself is Snakefile.


# Snakefile

# Define the result file
rule all:
    input:
        "results/processed_data.tsv"

# Data processing rule
rule process_data:
    input:
        "data/raw_data.tsv"
    output:
        "results/processed_data.tsv"
    shell:
        """
        cat {input} | awk -F'\t' '' > {output}
        """


config.yaml (optional): Configuration for the Snakemake workflow

requirements.txt (optional): Package dependencies

○ Input file

○ Output file

⑦ For Hi-C sequencing, there is QuASAR-QC.

Trouble-shooting

Type 2: Comparison or reproducibility metrics between samples: Used for internal validity assessment

Purpose 1: Evaluate sample quality by examining ranks of two correlated variables within a single sample

○ Example: Investigate alignment of expression levels of two known similar genes

○ Example: Examine if two genes known to be similar appear in the same cluster

Purpose 2: Often used to examine correspondence between pairs of samples

Purpose 3: Investigate correlation between two variables with different data distribution characteristics

○ Concept somewhat distinct from QC analysis

○ Example: Correlation coefficient between gene A expression in scRNA-seq and A expression in ST (Spatial Transcriptomics)

Method 1: Pearson correlation coefficient


image


○ Definition: For standard deviations σx and σy of X and Y,

○ Calculation in RStudio

○ cor(x, y)

○ cor(x, y, method = “pearson”)

○ cor.test(x, y)

○ cor.test(x, y, method = “pearson”)

Method 2: Spearman correlation coefficient


image


○ Definition: Defined for x’ = rank(x) and y’ = rank(x)

○ Calculation in RStudio

○ cor(x, y, method = “spearman”)

○ cor.test(x, y, method = “spearman”)

Method 3: Kendall correlation coefficient


image


○ Definition: Defined using concordant and discordant pairs

○ Steps 1 and 2: Sorting y values in ascending order for x values

○ Step 3: Counting concordant and discordant pairs

○ Step 4: Correlation coefficient definition

○ nc: total number of concordant pairs

○ nd: total number of discordant pairs

○ n: size of x and y

○ Calculation in RStudio

○ cor(x, y, method = “kendall”)

○ cor.test(x, y, method = “kendall”)

⑦ For Hi-C sequencing, there are HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep.



3. QC 3. Filtering

Type 1: Filtering during data processing

① Removal of inappropriate reads from sequencing data to generate trimmed data

Example 1: Trimmomatic

Type 2: Filtering in downstream analysis

2-1. Gene filtering: Excluding genes with low expression to avoid capturing non-informative or unrealistically DEG genes

2-2. Barcode filtering

Case 1: Low RNA expression (e.g., poor QC results)

Case 2: High RNA expression (e.g., contamination due to RNA diffusion)

Case 3. Doublet removal: scdDblFinder (R-based), hybrid score of scds (R-based), doubletCells of scran (R-based)

③ Can be implemented using the subset function in the R Seurat pipeline, matrix operations in Python scanpy, etc.



4. QC 4. Alignment

⑴ Definition: Process of finding the location of a specific sequencing read within a genome or transcriptome

Step 1: Raw data preparation: File extension is .Fastq

Type 1: Single-end sequencing (SES): Sequencing using one side of the adapter

Type 2: Paired-end sequencing (PES): Sequencing using both sides of the adapter

dUTP method standard: “_1.fastq” represents first strand (anti-sense), “_2.fastq” represents second strand (sense; original RNA sequence)

○ First, perform sequencing through one adapter (obtain Read1), and then perform sequencing through the opposite adapter (obtain Read2).

○ Read1 and Read2 from the same DNA fragment can be easily matched because they come from the same cluster.

○ Advantages: Higher accuracy (due to comparison between Read1 and Read2), easy extraction of DNA mutations, easy analysis of repetitive sequences, easy mapping between different species.

○ Disadvantages: Higher cost, requires more steps than SES.

③ Long-read sequencing


  Short-Read Seq Long-Read Seq
Release Year Early 2000s Mid 2010s
Average Read Length 150-300 bp 5,000-10,000 bp
Accuracy 99.9% 95-99%

Table 1. The comparison of short-read seq and long-read seq


image

Figure 2. The difference between short-read seq and long-read seq


○ There are no sequencing gaps, allowing for the following analyses:

Advantage 1: Alternative Splicing Analysis: Identification of alternative splicing events and isoforms is possible.

Advantage 2: Copy Number Variation (CNV) Analysis: For example, the number of repeat sequences is critical in Huntington’s Disease.

Advantage 3: Facilitates the integration of epigenetics and transcriptomics.

Type 1: Pacific Biosciences SMRT (Single Molecule Real-Time) sequencing: The average read length is ~20 kb.

Type 2: Oxford Nanopore Sequencing: The average read length is ~100 kb.

Step 2. Preprocessing: The process of preprocessing raw sequencing data to extract valuable summarized information.

Process 1. alignment


image

Figure 3. Alignment process


○ Definition: The process of determining how two or more sequences match each other. At this stage, the concept of a reference is not yet necessary.

○ Purpose:

○ To confirm the similarity between two sequences for use in mapping and assembly processes.

○ To identify mutations such as insertions, deletions, and substitutions.

Type 1: BLAST, BLAT: Released in 1997 and 2002, respectively. Based on k-mer hashing.

Type 2. Alignment for bulk RNA-seq

2-1. STAR: Most commonly used alignment tool. (ref)


### STEP 1. Installation
# Method 1 - reference : https://github.com/alexdobin/STAR 
wget https://github.com/alexdobin/STAR/archive/2.7.10b.tar.gz
tar -xzf 2.7.10b.tar.gz
cd STAR-2.7.10b
cd source
make STAR

# Method 2 
sudo apt install rna-star


### STEP 2. STAR genomeGenerate
STAR --runMode genomeGenerate \
     --runThreadN NumberOfThreads \
     --genomeDir /path/to/your/GenomeDir \
     --genomeFastaFiles /path/to/combined_genome.fa \
     --sjdbGTFfile /path/to/combined_genome.gtf 


### STEP 3. STAR run
STAR --genomeDir /path/to/your/GenomeDir \
     --readFilesIn /path/to/your/read1.fastq /path/to/your/read2.fastq \
		 --readFilesCommand zcat \
     --runThreadN NumberOfThreads \
     --outFileNamePrefix /path/to/your/outputPrefix \
     --outSAMtype BAM SortedByCoordinate

# Example     
STAR --genomeDir ./reference --readFilesIn read1.fastq read2.fastq --runThreadN 4 --outFileNamePrefix ./SaveDir --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate


2-2. Hisat2: Implemented the GFM (graph FM index) based on graph’s BWT extension, the first of its kind.

2-3. Bowtie, Bowtie2


## STEP 1. install

# Debian (e.g., Ubuntu)
sudo apt-get update
sudo apt-get install bowtie2

# Red Hat (e.g., Fedora, CentOS)
sudo yum install bowtie2

# Conda
conda install -c bioconda bowtie2

# macOS 
brew install bowtie2


## STEP 2. Genome Generation
bowtie2-build hg38.fasta human_reference


2-4. Tophat, Tophat2

2-5. BWA (Burrows-Wheeler Aligner)

Type 3. Pseudo-alignment for bulk RNA-seq

○ Pseudo-alignment simplifies some steps of alignment to handle the time-consuming process more quickly.

○ Pseudo-alignment is less accurate than alignment but faster.

3-1. Kallisto: paper, manual

3-2. Sleuth: blogpost, tutorial

3-3. Salmon: preprint, manual

Type 4. Alignment and counting for scRNA-seq

4-1. CellRanger: Uses STAR alignment.

Type 5. Alignment and counting for ST data

5-1. SpaceRanger: Limited to Visium. Uses STAR alignment but considers spatial barcoding.

5-2. SpaceMake : Integrated workflow to process various spatial transcriptomics data.

5-3. Numbat

Type 6. splice-aware aligner: Aligner that considers RNA-specific events like splicing during alignment.

Process 2. Mapping

○ Definition: Storing sequences in a hash table and using this table to map similar sections between different sequences.

○ Background: All-vs-all sequence comparison consumes astronomical amounts of time, thus it’s necessary to find short, representative hash values for each sequence.

○ General Process:

Step 1: Divide the given sequence into several sliding windows.

Step 2: Search for all contiguous sequences of k elements within each sliding window.

Step 3: Among the k contiguous sequences, the lexicographically smallest sequence is the minimizer for that window.

Step 4: Apply a quaternary number system to the minimizer (i.e., A = 0, C = 1, T = 2, G = 3) to translate the minimizer into a specific value.

Step 5: Apply a hash function to the minimizer to generate a hash value, aiming to alleviate base distribution asymmetry and increase database efficiency.

Step 6: Store each minimizer from every window in the corresponding element of the hash table based on its hash value.

Step 7: Generate a set of minimizers from the query sequence (the sequence of interest).

Step 8: Generate a set of minimizers from the target sequence (the reference sequence).

Step 9: Search for minimizer hits: compare the minimizer sets of the query and target sequences to explore similar sections between the two sequences.

○ Note on Hash Function: The hash function does not necessarily need to be reversible, but hash collision pairs should not be easily discovered.

○ Advantages of the Mapping Process: Speeds up comparisons between different sequences and ensures the pipeline operates robustly even in the presence of noise or many SNPs.

Type 1: BLASR

Type 2. DALIGNER

Type 3: MHAP

Type 4: GraphMap

Type 5: minimap, minimap2 (mapping algorithms designed for long-read sequencing)

Process 3: Error-Correction

○ Definition: Correcting errors from the mapping process. Often included within the assembly process.

Type 1: pbdagcon

Type 2: falcon_sense

Type 3: nanocorrect

Process 4. assembly


image

Figure 2. Assembly process


Definition: The process of finding the corresponding reference genome position for short DNA/RNA fragments obtained due to technical limitations.

○ General Process:

Step 1: Organize each sequence using the mapping results.

Step 2: Create an assembly graph: Use the distances in nucleotide sequences from multiple mappings to form corresponding edges.

○ Query sequence: A B CD E

○ Target sequence 1: B C ED A (because the order of A, B, C, D is incorrect)

○ Target sequence 2: A BC ←(remote)→ DE (because the distance between C and D is too far)

○ Target sequence 3: A B C D

○ In this case, it can be said that the ABCD of the query sequence corresponds to the ABCD of target sequence 3.

Step 3: Clean the assembly graph: If the corresponding sections are P → Q → R, it can be simplified to P → R.

Step 4: Generate unitigs.

○ During this process, information such as position and mismatch is generated.

○ The result of the assembly process generates a .bam file.

Type 1: wgs-assembler

Type 2: Falcon

Type 3: ra-integrate

Type 4: miniasm (an assembly algorithm for long-read sequencing)

Type 5: Numbat

Type 6. Velvet: Used to know the coordinate and feature of the breakpoint junction sequence.

Process 5. consensus polish

Type 1. Quiver

Type 2. nanopolish

Application 1. reference matching

○ Definition: Process of determining which reference a read comes from when using multiple references of different species.

○ Special algorithms are used to correct errors due to inter-species homology.

○ Typically, multiple .bam files mapped to different references are used to find more accurate references for each transcript.

○ This discussion implies that to accurately and comprehensively map RNAs from a given tissue, one should consider one or more optimal combinations of references.

Type 1. Xenograft alignment: Matching references for each transcript using .bam files from graft and host references.

1-1. XenofiltR: Takes .bam files from host and graft references. Generates .bam and .bam.bai files with only graft reads remaining.

1-2. BAMCMP: Separates graft and host reads in xenograft data. Divides into graft-only, host-only, ambiguous, unmapped classes.

1-3. disambiguate

1-4. Xenomake : Solely for spatial transcriptomics.

Type 2. Using multiple references in pseudo-alignment (ref)

○ Simple approach of constructing integrated references like a single species’ reference results in lower accuracy.

Type 3. Microbiome Alignment

3-1. Pathseq: Performs filtering, alignment, and abundance estimation in mixed human-microbe metagenomics data using GATK (Genome Analysis Toolkit).

3-2. Kraken

Application 2. Workflow Build

Type 1. SnakeMake : A tool for creating custom workflows independently.

Step 3. ( optional ) QC

SAMtools: Possible to use SAMtools view. Filtering by flags is possible. (ref)

Step 4. sorting: Sorting BAM files along the alignment axis.

Picard (release, trouble- shooting)


# How to install java on Ubuntu/Debian
sudo apt update
sudo apt install default-jdk

java -jar picard.jar SortSam \
      INPUT=input.bam \
      OUTPUT=sorted_output.bam \
      SORT_ORDER=coordinate


SAMtools


# install in Linux
sudo apt update
sudo apt install samtools

# execution of samtools
samtools sort -o sorted_file.bam input_file.bam


③ name-sorted: Typically, sorting is done based on the read name.

④ coordinate-sorted: For example, the possorted_genome_bam.bam file created in SpaceRanger is position-sorted.

Step 5. marking: Marking duplicates arising from PCR amplification.

Picard


java -jar picard.jar MarkDuplicates INPUT=sorted_file.bam OUTPUT=marked_duplicates.bam METRICS_FILE=metrics.txt


SAMtools


samtools markdup -r -s sorted.bam marked_duplicates.bam


○ -s: single-end read. If paired-end read, remove this part.

Step 6. indexing : .bam.bai file is generated from .bam file.

SAMtools


samtools index marked_duplicates.bam


Step 7. Count: Determining which feature (e.g., gene, exon) a read originated from when there is a read.


image

Figure 5. Counting process


① Algorithms determining whether a read came from gene A or gene B when there is a read.

Type 1. HTSeq

○ A tool based on gene overlap.

Most commonly used counting algorithm.

○ Command example:


# install
pip install HTSeq

htseq-count -f bam -r pos -s no -i gene_id -t exon mouse_sample.bam Mus_musculus.GRCm38.99.gtf > output.count


○ -f bam: Specifies that the input file is a .bam file.

-r pos: Ensures that the feature order in the output file matches the order in the reference.

-s no: -s refers to whether it is strand-specific, in which case it is single-strand seq. For paired-end seq, use -s yes or -s reverse.

-i gene_id: Uses the gene_id attribute from the GTF file as the reference for each feature. transcript_id is also possible.

-t exon: Specifies feature type. Default is exon.

Mus_musculus.GRCm38.99.gtf: Reference name. Can use .gff file instead of .gtf.

> output.txt: Outputs count result to output.txt file.

○ Result example:

○ Simple counting algorithm without assembly process.

Issue 1. When there are multiple isoforms, shared reads are either consistently counted as +1 or discarded, leading to inaccurate counts.

Issue 2: Decreased accuracy when counting at the transcript level compared to the gene level

Gene count > Sum of isoform counts that constitute each gene ( ambiguous reads)

Using genome-aligned data for transcript alignment without separate assembly can be inaccurate

○ Due to high similarity between transcripts, distinguishing by sequencing technology can be extremely challenging

Type 2: RSEM

○ Utilizes the EM (expectation maximization) technique to accurately predict counts

Step 1: RSEM installation


# step 1.
git clone https://github.com/deweylab/RSEM.git

# step 2. 
cd RSEM

# step 3.
make 

# step 4.
make install


Step 2: Prepare reference files


rsem-prepare-reference --gtf annotation.gtf genome.fa reference_name


Step 3: Expression estimation


# case 1. single-end
rsem-calculate-expression --single-end --alignments -p 4 aligned_reads.sam reference_name sample_output

# case 2. paired-end
rsem-calculate-expression --paired-end --alignments -p 4 aligned_reads.bam reference_name sample_output


Type 3: StringTie


# install
conda install -c bioconda stringtie

# count
stringtie sorted_bam.bam -o output.gtf -G reference.gtf -A gene_abundances.tsv


○ genome-based: (Comment) StringTie’s method for transcript-level abundance is not straightforward

Includes assembly process

○ Takes sorted_bam.bam file from genome-aligned reads as input, outputs separate output.gtf file.

○ Uses novel network flow algorithm

○ Example results

Type 4: featureCounts (install)


./subread-2.0.6-Linux-x86_64/bin/featureCounts -a GRCh38.gtf -o counts.txt possorted_genome_bam.bam


○ Example results

Type 5: Cufflinks

Type 6: Hisat2: genome-based

Type 7: eXpress: transcriptome-based

Type 8: bowtie2: transcriptome-based

Type 9: Ballgown (ref): Provides gene count, transcript count, DEG analysis results, etc.

Step 1: Prepare .ctab files

Method 1: TopHat2 + StringTie

Method 2: TopHat2 + Cufflinks + Tablemaker

○ StringTie command example: Generate .ctab files using the -B argument


stringtie -e -B -p 8 -G ref.gtf -l sample -o output.gtf aligned_reads.bam


Step 2: Check directory structure


ballgown/
├── sample1/
   ├── e_data.ctab
   ├── e2t.ctab
   ├── i_data.ctab
   ├── i2t.ctab
   ├── t_data.ctab
   └── output.gtf
├── sample2/
   ├── e_data.ctab
   ├── e2t.ctab
   ├── i_data.ctab
   ├── i2t.ctab
   ├── t_data.ctab
   └── output.gtf
├── sample3/
   ├── e_data.ctab
   ├── e2t.ctab
   ├── i_data.ctab
   ├── i2t.ctab
   ├── t_data.ctab
   └── output.gtf
...


Step 3: Execute Ballgown: Can be run with R


library(ballgown)
bg = ballgown(dataDir = "ballgown", samplePattern = "sample", meas = "all")
gene_expression = gexpr(bg)
transcript_expression = texpr(bg, 'all')


○ Example results

Type 10: Pathseq: Performs filtering, alignment, abundance estimation on mixed metagenomics data of human and microbes using GATK (Genome Analysis Toolkit).

Type 11: Miniasm: an assembly algorithm for long-read sequencing.

Type 12. wgs-assembler

Type 13. Falcon

Type 14. ra-integrate

Type 15: scRNA-seq alignment and count

CellRanger

Type 16: ST alignment and count

SpaceRanger: Limited to Visium. Uses STAR alignment and considers spatial barcoding additionally

SpaceMake: Integrated workflow for various spatial transcriptomics data types

Step 8. (optional) Error correction: It is possible to correct mapping errors during the assembly process.

pbdagcon

falcon_sense

nanocorrect

Step 9. (optional) Variant calling: Can investigate genetic variations like SNPs or indels. Variants are saved as a VCF file.

① Overview: A BED file or VCF file is generated from a BAM file, followed by the creation of a BEDgraph file (a graph data structure file composed of BED), a Wiggle file (a file comparing with the control group), and a bigWig file (a compressed binary version of the Wiggle file).

② GATK (Genome Analysis ToolKit): HaplotypeCaller in GATK is commonly used

③ Freebayes

④ SAMtools

CaVEMan (Cancer Variants Through Expectation Maximization): Used for somatic substitution calling.

Pindel: Used for Indel calling.

BRASS (BReakpoint AnalySiS): Used for structural variant calling.

MACS2: Variant calling pipeline for ChIP-seq

Step 10. (optional) Post-hoc ** **transcript-to-gene conversion: When having count data for transcript IDs, it might be necessary to collapse multiple transcript IDs into one gene ID (e.g., for GO analysis)

① Example: GRCh38 (human) GFF file contains the following information


ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,GenBank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2


② Post-conversion methods

Method 1. RefSeq

Method 2. UCSC knowngene

Method 3. Ensembl: manual

Method 4. GENCODE

Method 5. Useful functions available in R (for human, ref)


library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")  
ensembl_transcript_to_gene <- function(transcript_ids){
  # reference : https://support.bioconductor.org/p/106253/#106256

  res <- getBM(attributes = c('ensembl_transcript_id_version', 
                              'ensembl_gene_id', 
                              'external_transcript_name',
                              'external_gene_name'),
               filters = 'ensembl_transcript_id_version', 
               values = transcript_ids,
               mart = mart)

  if(dim(res)[1] == 0){
    return("")
  }	

  return(res[, 'external_gene_name'])
}




5. QC 5. Normalization

⑴ Overview

① Definition: Correction for biases where RNA reads do not accurately reflect gene expression due to technical limitations

② In essence, correcting systemic batch effects like library size

Type 1. library size normalization (Depth-based normalization)

① When comparing different samples, divide each sample by a normalization factor to adjust for the total RNA transcripts count

○ In other words, to compare specific gene expression between samples

○ Sequencing depth, a limitation specific to sequencing machines, can be referred to as library size

1-1. RPM (reads per million mapped reads) or CPM (counts per million mapped reads)

○ Divide each gene count by the total count of the sample and multiply by 10^6


image


R code: Maintains equal library size, doesn’t use methods like TMM


library(edgeR)
raw_counts_matrix = ?
group = ?
DEGL <- DGEList(counts=raw_counts_matrix, group=group)

# Keep the raw library size
DEGL_cpm <- calcNormFactors(DEGL, method = "none")

# Calculate the cpm
cpm <- cpm(DEGL_cpm, log = FALSE, normalized.lib.sizes=TRUE)

# Calculate the log cpm
log_cpm <- cpm(DEGL_cpm, log = TRUE, normalized.lib.sizes=TRUE)

#Check out the cpm normalized matrix and log cpm normalized matrix
head(cpm)
head(log_cpm)


1-2. TMM (trimmed mean of M-values) (ref)

○ Definition: Normalize given read counts by dividing by the total read count of the sample + Trimming

○ Proposed by Robinson & Oshlack (2010) (ref)

○ Since trimming is performed, the library size does not remain constant.

Significance 1. Corrects the bias where read count does not accurately represent actual gene activity and is proportional to sequencing depth(ref)

Condition: Ensure genes with the same expression levels in both samples are not detected as DEGs

Assumption: Majority of genes are not differentially expressed (consistent with author’s experience)

Thought experiment

○ Sample A is a mixture of human + mouse RNA, Sample B is specifically human RNA from Sample A, both samples having equal human and mouse RNA counts

○ When depths are equal, human RNA reads in A would be exactly half of B’s human RNA reads: reads in A are distributed over genes twice as much

○ To adjust each gene’s RNA read count in A, multiply by a factor (normalization factor) twice that of B’s

○ This assumption leads to batch + sample effect ≃ batch effect, making the normalization method applicable only in this case

Significance 2. Determining total RNA production Sk is difficult, but calculating the ratio Sk / Sk’ for two samples is relatively easy

○ Brief definition: Divide given read count by the sample’s total count

○ Example


image


○ The following depicts differential gene expression of gene g between Sample 1 and Sample 2


image


○ 1st. Symbol definitions

○ Lg: Length of gene g

○ μgk: Actual transcript count of gene g in sample k. Represents expression level. Related to population

○ Nk: Total transcript count in sample k

○ Sk: RNA count in sample k


image


○ Sk / Nk: Average RNA count per transcript

○ Ygk: Observed transcript count of gene g in sample k. Related to the sample population


image


○ Mg: Gene-wise log-fold-change


image


○ Ag: Absolute expression level


image


○ Sk / Sr: Scaling factor to divide by in sample k. Proportional to Sk. Refer to the thought experiment above.

○ TMM: Normalization factor


image


○ 2nd. Remove genes with expression of 0

○ 3rd. Trimming: The aspect that is most fundamentally different from the RPM or CPM.

○ Trimmed mean: Average of data excluding top x% and bottom x%

○ Doubly trimmed: Trim based on log-fold-change Mrgk and absolute intensity Ag

○ Initial researchers trimmed 30% each way by Mrgk and 5% each way by Ag

○ 4th. Divide by TMM for sample k

○ wrgk: Larger weights for genes with low expression to prevent distortion

○ When Nk = Nk’ and Ygk = 2 × Ygr, TMM is approximately 2


image


R code: Change library size with TMM and then calculate CPM


library(edgeR)
raw_counts_matrix = ?
group = ?
DEGL <- DGEList(counts=raw_counts_matrix, group=group)

# Calculate normalization factors using TMM method to align columns of a count matrix
DEGL_TMM <- calcNormFactors(DEGL, method="TMM")

# Calculate the cpm with the TMM normalized library
TMM <- cpm(DEGL_TMM, log = FALSE, normalized.lib.sizes=TRUE

# Check out the cpm of TMM normalization
head(TMM)


Application 1. GeTMM method

1-3. RLE (relative log estimate)

○ Proposed by Anders & Huber (2010)

○ Normalization method adopted by DESeq, an R package by Love et al. (2014)

Step 1: Calculate geometric mean for all samples to obtain the median library

Step 2: Scale each sample’s median ratio to the median library as scale factor

Step 3: Divide the sample’s count values by the scale factor

R code


library(edgeR)
raw_counts_matrix = ?
group = ?
DEGL <- DGEList(counts=raw_counts_matrix, group=group)

# Calculate normalization factors using RLE method to align columns of a count matrix
DEGL_RLE <- calcNormFactors(DEGL, method="RLE")

# Calculate the cpm with the RLE normalized library
RLE <- cpm(DEGL_RLE, log = FALSE, normalized.lib.sizes=TRUE)

# Check out the TMM normalized result
head(RLE)


1-4. UQ (Upper Quartile) Normalization

○ A method to adjust expression levels at specific quantiles to be the same when analyzing two transcriptomes.

○ Proposed by Bullard et al. (2010).

○ Generally uses the upper 75% (lower 25%) quartile values as variables (cf. Q3-norm).

○ Mostly used in microarray data.

R Code


library(edgeR)
raw_counts_matrix = ?
group = ?
DEGL <- DGEList(counts=raw_counts_matrix, group=group)

# Calculate normalization factors using UQ method to align columns of a count matrix
DEGL_UQ <- calcNormFactors(DEGL, method="upperquartile")

# Calculate the cpm with the UQ normalized library
UQ <- cpm(DEGL_UQ, log = FALSE, normalized.lib.sizes=TRUE)
# Check out the UQ normalized result
head(UQ)


Type 2. Gene Length Normalization

① Definition: A method of correcting for gene length when comparing the expression of different genes (or exons, isoforms) within a single sample.

○ Here, gene length refers to the effective length, which is the actual gene length minus the read length.

○ Gene length correction: Gene counts are proportional to gene length, so count values are divided by gene length.

○ Gene length normalization is criticized for not properly reflecting data characteristics, with decreasing importance of metrics like FPKM, TPM.

2-1. RPKM (Reads Per Kilobase of Transcript per Million Mapped Reads)

○ Formula: RPKM for gene i is represented as Q reads and ℓ exon length.


image


○ Used to compare gene expression among different genes within a sample.

2-2. FPKM (Fragments Per Kilobase of Exon per Million Mapped Fragments)

○ Formula: FPKM for gene i is represented as q reads and ℓ exon length.


image


○ Similar to RPKM, but used only for paired-end RNA-seq (Trapnell et al., 2010).

2-3. TPM (Transcripts Per Kilobase Million): Proposed by Li et al. (2010)

○ Definition: Normalized based on the proportion of each transcript in the library.

○ Relatively easy to use due to length-based normalization, more common than CPM.

○ Summing TPM values in a single library gives 1,000,000, allowing comparison between different samples.


image


○ Relation between TPM and RPKM: For the number of genes n,


image


○ TPM is highly correlated with RPKM (or FPKM) results

Type 3. Log-Transformation

Purpose 1: For genes with high count values, the actual activity of those genes may be exaggerated in interpretation.

Purpose 2: When there are both very small and very large values, it is necessary to adjust the scale to be similar.

③ To address the above issues, the results of log-transforming the normalized counts are considered as gene expression values.

Type 4. scaling

① For each patient and each sample, the range of gene expression values varies. To compare them rationally, the range of values is adjusted.

② For example: In the case of the TCGA, Seurat pipeline, the maximum gene expression value is set to around 10 → the minimum value generally becomes negative.

⑹ R: Obtaining normalized expression using the Seurat package.


library(dplyr)
library(Seurat)

pbmc.data <- Read10X(data.dir = "./Spatial_matrix/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 100)
pbmc[["percent.mt"]] <- PercentageFeatureSet(object = pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 500)
pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(object = pbmc, selection.method = "vst", nfeatures = 2000) 

### Download for Normalized Count Data
write.csv(pbmc@assays$RNA@data, "./normalized_counts.csv")


all.genes <- rownames(x = pbmc) 
pbmc <- ScaleData(object = pbmc, features = all.genes) 

### Download for Scaled Expression Data (also known as TCGA scale) 
write.csv(pbmc@assays$RNA@scale.data, "./scaled_expression.csv")


⑺ Python: Obtaining normalized expression using the scanpy package (ref)


import scanpy as sc
import pandas as pd

tissue_dir = './'

adata = sc.read_visium(tissue_dir) 
adata.var_names_make_unique()
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)

pd.DataFrame(adata.X).to_csv('normalized_expression.csv')


⑻ Paper Representation

○ The count data were normalized by log2-transformation of counts per million (CPM) + 1 pseudocount, and scatter plots were generated for each pair of consecutive section using the ggplot2 package (28) in R Studio (versioin 1.1.453). The built-in stats package was used to compute Pearson correlations. (ref)



6. QC 6. Batch Effect

⑴ Overview

① Background: Apart from experimental variables, batch effects can influence results between experimental and control groups.

② Batch Effect: Experimental outcomes affected by factors other than biological variables. Examples include:

○ Sequencing dates

○ Researchers performing sequencing

○ Sequencing equipment

○ Protocols

③ Systemic batch effects like library size can be corrected through normalization.

④ Batch effect removal process referred to as removing non-biological batch effects, often using regression analysis.

⑵ Batch Effect Removal

① This is particularly important in DEG analysis through comparison between samples.

② Caution


    batch condition
    <factor> <factor>
1          1        A 
2          1        A
3          1        B
4          1        B
5          2        C
6          2        C


○ Batch effects cannot be theoretically removed if a specific condition applies to a specific batch.

○ However, even in this case, it is possible to create an incomplete regression model by distinguishing between batch effect and covariate to remove the batch effect.

Method 1: limma::removeBatchEffect: Removes batch effects by inputting normalized expression matrix and batch information into a linear model. (ref)

Method 2: sva::ComBat (Johnson et al., 2007): Empirical Bayes-based method for removing batch effects when batch information is known. (ref, ref, ref)


# Import the sva package
library(sva)

# Create batch vector
batch <- c(1,2,3,1,2,3)

# Apply parametric empirical Bayes frameworks adjustment to remove the batch effects
combat_edate_par = ComBat(dat=TMM, batch=batch, mod=NULL, par.prior=TRUE, prior.plots=TRUE)

# Apply non-parametric empirical Bayes frameworks adjustment to remove the batch effects
combat_edata_non_par = ComBat(dat= TMM, batch=batch, mod=NULL, par.prior=FALSE, mean.only=TRUE)

# Check out the adjusted expression profiles
head(combat_edate_par)
head(combat_edate_non_par)


Method 3: ComBat_seq (Zhang et al., 2020): Takes raw count matrix and outputs adjusted count matrix. Batch information should be known.


library(sva)

# Create batch vector
batch <- c(1,2,3,1,2,3)

# Include group condition
combat_seq_with_group <- ComBat_seq(raw_counts_matrix, batch=batch, group=group, full_mod=TRUE)
# Without group condition
combat_seq_without_group <- ComBat_seq(raw_counts_matrix, batch=batch, group=NULL, full_mod=FALSE)

# Check out the adjusted expression profiles
head(combat_seq_with_group)
head(combat_seq_without_group)


Method 4: SVA seq (Leek, 2014): Functions even without known batch information.

Method 5: RUV seq (Risso et al., 2014)

⑶ Measurement of the accuracy of batch effect removal

① k-nearest-neighbor batch effect test (kBET)

② ASW across batches

③ k-nearest-neighbor (kNN) graph connectivity

④ batch removal using PCA regression

⑷ Application 1: Clustering and Batch Effect

① Clustering can be used for batch effect removal by dividing experimental and control groups within major clusters (e.g., cell type clusters) to obtain DEGs.

⑸ Application 2: Data Integration

① Merging data from different batches or modalities, with consideration of batch correction effects.

② Modifying two groups to have the same characteristics.

Example 1: Train the model so that the clustering patterns are not affected by batch effects.

Example 2: Train the model to minimize the discriminator’s performance, which distinguishes data based on batch-specific effects (i.e., to make the data indistinguishable after integration).


image

Figure 6. Types of Data Integration


Method 1. Canonical Correlation Analysis (CCA: Slightly overfits to align two groups. Used in R Seurat pipeline.

Method 2. RPCA (reciprocal principal component analysis): Aligns two groups with less overfitting, suitable for cases with significantly different tissue characteristics. Used in R Seurat pipeline.

Method 3. MNN (Mutual Nearest Neighbors) (git): Utilized in R scater or Python scanpy.

Method 4. BBKNN (Batch-Balanced k-Nearest Neighbor): Using PCA

Method 5. Scanorama: Used in Python scanorama. Using singular value decomposition

Method 6. Conos : Used in R conos.

Method 7. scArches: Utilizes transfer learning.

Method 8. DESC

Method 9. fastMNN (batchelor): Using PCA

Method 10. Harmony: Using PCA. Harmony does not provide batch effect-corrected normalized expression, but it does provide batch effect-corrected low-dimensional embeddings that can be used for downstream analysis.

Method 11. LIGER: A general approach for integrating single-cell transcriptomic, epigenomic and spatial transcriptomic data. factor-based. Using integrative non-negative matrix factorization

Method 12. SAUCIE

Method 13. scANVI: conditional VAE (ref)

Method 14. scGen: conditional VAE (ref)

Method 15. scVI: conditional VAE (ref)

○ Provides normalized expressions that remove batch effects and provides code to calculate DEGs based on them (ref).

○ The provided code for calculating DEGs (Differentially Expressed Genes) is not only slow but also inaccurate: removing batch effects from 10 datasets might slightly differ from the removal in just 2 of those datasets, which can sometimes lead to completely opposite DEG patterns.

○ It is recommended to use scVI normalized expression and scanpy.tl.rank_genes_groups to identify DEGs.

Method 16. TrVae: conditional VAE (ref)

Method 17. TrVaep

Method 18. scib: In addition to merging multiple integration tools, benchmarking is also provided.

○ Integration function: BBKNN, ComBat, DESC, Harmony, MNN, SAUCIE, Scanorama, scANVI, scGen, scVI, trVAE

○ Benchmarking metric: ARI, ASW, F1, mutual score, etc.

Method 19. iMAP

Method 20. INSCT

Method 21. scDML

Method 22. scDREAMER, scDREAMER-Sup: It even enables inter-species integration.

Method 23. SATURN: Integration of datasets from different species

Method 24. GLUE (Cao and Gao, 2023): A tool for integrating unpaired data from different platforms such as scRNA-seq, scATAC-seq, and snmC-seq.

Step 1: Embed cell × feature data from each omics into cell × embedding, where each cell is represented in a common dimension.

Step 2: To represent the relationships between omics, use a knowledge-based guidance graph to represent regulome × embedding, where each regulome is expressed as a vector in the same dimension as in Step 1.

Step 3: Construct an autoencoder by feeding the inputs from Step 1 and Step 2 into the encoder, with the decoder outputting the integrated data.

Step 4: Adversarial learning: Introduce a discriminator to identify platform-specific effects, and search for autoencoder parameters that minimize the discriminator’s performance (thus facilitating better integration).

Method 25. Seurat Anchor (Stuart et al., 2019): factor-based

Method 26. DC3 (Zeng et al., 2019): factor-based

Method 27. coupled NMF (Duren et al., 2018): factor-based

Method 28. SCOT (Demetci et al., 2022): topology-based

Method 29. UnionCom (Cao et al., 2020): topology-based

Method 30. Panoma (Cao et al., 2022): topology-based

Method 31. MrVI

2-1. Merge between scRNA-seq or ST

① Except for clustering, most methods require switching DefaultAssay from integrated to RNA/Spatial/SCT, etc

② FindMarkers

③ FindAllMarkers

④ FeaturePlot

⑤ SpatialFeaturePlot

⑥ DotPlot

⑦ DimPlot

⑧ SpatialDimPlot

⑨ genewise correlation

⑩ trajectory analysis

2-2. Merging scRNA-seq data with other modalities


image

Figure 7. Merging scRNA-seq data


스크린샷 2024-09-05 오전 10 22 40

Table 2. Analysis methods for matched data


스크린샷 2024-09-05 오전 10 23 15

Table 3. Analysis methods for unmatched data



7. Common 1. Clustering

⑴ Type 1: K Means Clustering

⑵ Type 2: Unsupervised Hierarchical Clustering

⑶ Type 3: Matrix factorization

① Algorithm to factorize a known matrix A into W and H matrices: A ~ W × H

○ A matrix: Represents samples and features. Can be inferred from samples.

○ H matrix: Represents variables and features.

○ Similar to K means clustering, PCA algorithm.

○ An autoencoder is a broader concept than matrix factorization because it includes non-linear transformations.

○ The following algorithms are based on the least square method, but can also utilize methods such as gradient descent.

② Algorithm: R = UV, where R ∈ ℝ5×4, U ∈ ℝ5×2 , and V ∈ ℝ2×4, searches for U and V.


import numpy as np
R = np.outer([3,1,4,2.,3],[1,1,0,1]) + np.outer([1,3,2,1,2],[1,1,4,1])
U=np.random.rand(5,2)
V=np.random.rand(2,4)
for i in range(3):
    V,_,_,_=np.linalg.lstsq(U, R, rcond=None)
    Ut,_,_,_=np.linalg.lstsq(V.T, R.T, rcond=None)
    U=Ut.T
U@V # it is similar to R!


③ NMF(non-negative matrix factorization)


import numpy as np
R = np.outer([3,1,4,2.,3],[1,1,0,1]) + np.outer([1,3,2,1,2],[1,1,4,1])
U=np.random.rand(5,2)
V=np.random.rand(2,4)
for i in range(20):
    V,_,_,_=np.linalg.lstsq(U, R, rcond=None)
    V = np.where(V >= 0, V, 0) #set negative entries equal zero
    Ut,_,_,_=np.linalg.lstsq(V.T, R.T, rcond=None)
    U=Ut.T
    U = np.where(U >= 0, U, 0) #set negative entries equal zero
U@V # it is similar to R!


④ Matrix completion (Netflix algorithm): performing matrix factorization on a masked R.


import numpy as np
R = np.outer([3,1,4,2.,3],[1,1,0,1]) + np.outer([1,3,2,1,2],[1,1,4,1])
mask=np.array([[1.,1,1,1],
               [1,0,0,0],
               [1,0,0,0],
               [1,0,0,0],
               [1,0,0,0]])
               
U=np.random.rand(5,2)
V=np.random.rand(2,4)
RR = U@V
RR = RR*(1-mask)+R*mask

for i in range(20):
    V,_,_,_=np.linalg.lstsq(U, R, rcond=None)
    V = np.where(V >= 0, V, 0)
    Ut,_,_,_=np.linalg.lstsq(V.T, R.T, rcond=None)
    U=Ut.T
    U = np.where(U >= 0, U, 0)
    RR = U@V
    RR = RR*(1-mask)+R*mask

RR*(1-mask)+R*mask # it is similar to R!


Application 1. Cell type classification

○ Aims to obtain cell type proportions from scRNA-seq data from tissues.

○ Important to reduce confounding effects due to cell type heterogeneity.

3-1. Constrained linear regression

3-2. Reference-based approach

3-2-1. CIBERSORT (cell-type identification by estimating relative subsets of RNA transcript): Allows you to check the cell type proportion and p-value for each sample.

Application 2. joint NMF: Expands to multi-omics.

Application 3: Metagene extraction

Application 4. Starfysh: The following algorithm infers archetypes from spatial transcriptomic data and determines anchors representing each archetype:

Step 1. Construct an autoencoder


스크린샷 2024-10-07 오후 6 50 41


○ X ∈ ℝS×G: Input data (spots × genes)

○ D: Number of archetypes

○ B ∈ ℝD×S: Encoder. In the context of inferring archetypes, the sum of the distribution of each archetype across all spots must be 1.

○ H = BX: Latent variable

○ W ∈ ℝS×D: Decoder. In the context of reconstructing the input data, the sum of the weights of all archetypes for each spot must be 1.

○ Y = WBX: Reconstructed input

Step 2. Solve the optimization algorithm to calculate W and B


스크린샷 2024-10-07 오후 6 51 51


Step 3. Spots with the highest weights for each archetype in the W matrix are selected as anchor spots

Step 4. Adjust granularity: If the distance between archetypes is close, merge them or use a hierarchical structure to adjust the distance

Step 5. Form archetypal communities by searching for the nearest spot for each anchor and identifying marker genes

Step 6. If a signature gene set is given, add archetypal marker genes to the existing gene set and recalculate the anchors.

○ During this, use the stable marriage matching algorithm to match each archetype with the most similar signature.

Category 4. Other clustering algorithms

① SNN (shared nearest neighbor) modularity optimization based clustering algorithm

② Leiden clustering

③ Louvain clustering

④ mean-shift clustering

⑤ DBSCAN (density-based spatial clustering of applications with noise)

⑥ spectral clustering

⑦ gaussian mixture

watershed algorithm

⑨ thresholding method

⑩ MST (minimum spanning tree)

⑪ curve evolution

⑫ sparse neighboring graph

⑬ SC3

⑭ SIMLR

⑮ FICT

⑯ fuzzy clustering



8. Common 2. Differentially Expressed Gene (DEG) Analysis

⑴ Definition: Process of finding genes that differ between experimental and control groups.

⑵ Criteria for DEG: Slightly varies depending on experimental design.

① FC (fold change)

○ Definition: Ratio of treatment’s average gene expression to control’s average gene expression.

Issue 1. Division by zero: Genes with such issues are either removed beforehand or assigned special values through separate interpretations.

Issue 2. Asymmetrical values around 1: Addressed by using log fold change.

Adjusted p-value

○ Multiple Testing Problem: The act of conducting multiple statistical tests itself can lead to inaccurate conclusions.

Type 1. Control of Family-Wise Error Rate (FWER)

○ Definition: The probability of reaching at least one incorrect conclusion among all hypotheses tested. For instance, a FWER of 5% means there is a less than 5% chance of making even one incorrect conclusion across multiple hypothesis tests. This approach is very conservative, allowing almost no false positives.

1-1. Sidak Correction: Adjusts the alpha threshold rather than the p-values and is used when p-values are independent.

1-2. Bonferroni Correction: Directly adjusts each p-value and can be applied even when p-values are not independent. It is highly conservative.

Type 2. Control of False Discovery Rate (FDR)

○ Definition: A method that limits the probability of making incorrect conclusions (false discoveries) among hypotheses where no true difference exists, keeping the FDR below a certain level.

2-1. Benjamini–Hochberg (B&H): Used when the correlations between tests are relatively simple.

2-2. Benjamini–Yekutieli (B&Y): Used when the correlations between tests are complex.

Visualization 1. Volcano plot

○ x-axis: log fold change

○ y-axis: log adjusted p-value

○ Useful for visualizing the distribution of genes satisfying the DEG condition.

Visualization 2. MA plot

○ x-axis: Mean of log normalized count

○ y-axis: log fold change

Widely used in microarray analysis and applicable to RNA-seq.

⑶ Statistical Techniques

① t test

○ One of the commonly used statistical techniques

○ Parametric statistical estimation

○ Applies t-test after taking the log of expression (FPKM/TPM)

○ Not recommended for small sample sizes due to difficult variance estimation: In this case, DESeq, EdgeR, limma are recommended

○ Strict application of FC threshold is recommended

Mann-Whitney-Wilcoxon (Wilcoxon rank-sum test)

○ Non-parametric statistical estimation

○ Not recommended for sample sizes less than 10

○ For small samples, it is recommended to use limma, edgeR, DESeq2, etc.

ANOVA & Kruskal-Wallis test

○ Used for multi-level single factor DEG analysis

○ For sample sizes less than 10, it’s recommended to use DESeq2, EdgeR, limma’s multi-level single factor mode

⑷ DEG Tools

① DESeq

DESeq2 (paper, manual)

○ Input: raw count

○ Based on the negative binomial model: Useful when transcriptome has many zero counts and doesn’t follow a normal distribution

○ Note the different preprocessing methods for obtaining DEGs and visualization

③ edgeR

○ edgeR (exact)

○ edgeR (GLM)

④ limma

○ limma + voom (paper, manual)

○ Input: raw count (avoid using normalized data like FPKM with voom)

○ GLM-based

○ 1st. Removing the correlation between mean and variance present in count data using voom

○ 2nd. DEG exploration based on linear model

○ The matrix used for obtaining DEGs is also used for visualization

○ limma FPKM (paper, manual)

○ Input: FPKM (useful when raw count data is not available)

○ 1st. Convert FPKM to log2 scale

○ 2nd. Run limma’s eBayes() function with the trend = TRUE option

○ This method is similar to what’s used in microarray analysis and is akin to the limma-trend method

⑤ cuffdiff

⑥ BaySeq

⑦ DEGSeq

⑧ NOISeq

⑨ PoissonSeq

⑩ SAMSeq

scVI

○ Provides normalized expressions that remove batch effects and provides code to calculate DEGs based on them (ref).

○ The provided code for calculating DEGs (Differentially Expressed Genes) is not only slow but also inaccurate: removing batch effects from 10 datasets might slightly differ from the removal in just 2 of those datasets, which can sometimes lead to completely opposite DEG patterns.

○ It is recommended to use scVI normalized expression and scanpy.tl.rank_genes_groups to identify DEGs.



9. Common 3. Gene Set Analysis

⑴ Overview

① Definition: Analyzing gene sets as a whole, rather than individual genes.

② Terminology

Gene Score: Refers to the value generated by a list of genes.

○ Signature: A dataframe composed of gene names, FC (fold change) values, or p-values. Gene score can also be considered a signature.

③ Principle: Create a contingency table and confirm significance using Fisher’s exact test (hypergeometric test).


image

Table 4. Contingency table in gene set analysis


Statistic 1. Probability: Probability of being like the sample.

Statistic 2. Odds ratio: Shows if GO and Gene Set are similar or dissimilar.


image


Statistic 3. Gene ratio: Represents A / (A+B), i.e., the ratio of common genes to the input gene set.

Statistic 4. Count: Usually represents A, the number of intersection elements between two sets.

④ Interpretation

○ Rank genes based on expression between experimental and control groups, then calculate enrichment scores.

○ If genes from the query gene set are highly concentrated at both extremes (e.g., top 10 or bottom 10) of the ranked gene list, the enrichment score increases.

Type 1. GSEA (gene set enrichment analysis) (manual)

① Input

Type 1. expression file (.gct or .txt): Expression values

Type 2. phenotype label (.cls): Group names (e.g., control, normal)

Type 3. gene annotation (.chip): Gene names

Type 4. gene set data: Gene sets

Category 1. M1H: Mouse-ortholog hallmark gene sets, etc.

Category 2. M1: Positional gene sets, etc.

Category 3. M2: Curated gene sets, etc.

Category 4. M3: Regulatory target sets, etc.

Category 5. M5: Ontology gene sets, etc.

Category 6. M8: Cell type signature gene sets, etc.

② Output

○ FDR (false discovery rate)

○ NES (normalized enrichment score)

○ leading edge subset

Type 2. GO (gene ontology)

① Major bioinformatics initiative to integrate the expression of genes and gene products across all species

② Grouped according to CC (cellular component), MF (molecular function), BP (biological process), etc.

③ Implementation: Functions implemented through R


library(EnhancedVolcano)
library(clusterProfiler)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(enrichplot)

GO.plot <- function(gene){
    # ont = "ALL", "BP", "CC", "MF"
    # showCategory is not mandatory

    gene <- gsub("GRCh38", "", gene) # human 데이터 가공시의 reference 이름 제거
    gene <- gsub("mm10", "", gene) # mouse 데이터 가공시의 reference 이름 제거
    for(i in 1:10){
  	  gene <- gsub("-", "", gene) # 불필요한 앞부분의 - 제거
    }
    gene <- gsub('\\ .*$', '', gene) # 'KLK2 ENSG00000167751' 같은 것을 해결 
    
    if (gene[1] == toupper(gene[1])){ ## Human gene
        gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)
        gene.df <- as.vector(gene.df[[2]])
        GO <- enrichGO(gene.df, OrgDb = 'org.Hs.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
        dotplot(GO,split="ONTOLOGY", showCategory = 5)+facet_grid(ONTOLOGY~., scale="free")
	} else{ ## Mouse gene?
        gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
        gene.df <- as.vector(gene.df[[2]])
        GO <- enrichGO(gene.df, OrgDb = 'org.Mm.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
        dotplot(GO,split="ONTOLOGY", showCategory = 5)+facet_grid(ONTOLOGY~., scale="free")
    }
}

### Example
GO.plot(c("COL1A1", "COL1A2", "COL3A1", "COL6A3"))


④ Example results


image

Figure 8. Example results of GO analysis


Count: Size of the intersection between input gene set and each GO term, i.e., the number of common genes

p.adjust: p value adjusted using Fisher’s exact test and Benjamini-Hochberg (B&H) adjustment between input gene set and each GO term

GeneRatio: Gene ratio, i.e., proportion of common genes in the input gene set (ref)

How to interpret a GO plot

Type 3. DAVID (functional annotation bioinformatics microarray analysis)

① Provides biological interpretation of submitted genes

Type 4. MSigDB (molecular signature database)

① H: Hallmark gene sets (50 terms)

② C1: Positional gene sets (299 terms)

③ C2: Curated gene sets (6226 terms)

④ C3: Regulatory target gene sets (3556 terms)

⑤ C4: Computational gene sets (858 terms)

⑥ C5: Ontology gene sets (14765 terms)

⑦ C6: Oncogenic signature gene sets (189 terms)

⑧ C7: Immunologic signature gene sets (4872 terms)

⑨ C8: Cell type signature gene sets (302 terms)

Type 5. EnrichR

① Performs gene set analysis using various gene set information and provides an online-based API

② Statistical significance calculation using Fisher’s exact test and Benjamini-Hochberg (B&H) adjustment

③ Shows concordance between submitted genes and other annotated gene sets

Type 6. ToppGene

① Transcriptome

② Proteome

③ Regulome (e.g., TFBS, miRNA)

④ Ontologies (e.g., GO, Pathway)

⑤ Phenotype (e.g., human disease, mouse phenotype)

⑥ Pharmacome (e.g., Drug-Gene associations)

⑦ Literature co-citation

Type 7. iLINCS: Drug database

Type 8. g:Profiler

Type 9. KEGG (kyoto encyclopedia of genes and genomes)

① Biochemical pathway database created in Japan (1995)

② One of the most cited databases in the field of biology

③ Includes various information related to systems, genes, health, chemistry, etc., centered around biochemical pathways

④ Web-based tool: g:Profiler

Type 10. Metabolic Pathways

Type 11. REAC (reactome), WP, TF, etc.

① Shows pathways related to submitted genes

② Web-based tools: g:Profiler, iLINCS

Type 12. IPA (ingenuity pathway analysis)

① Commercial software

② Visualizes pathways and networks through omics data

③ Allows identification of causal mechanisms and key factors using data

④ Includes information like canonical pathways, upstream regulator analysis, and updated information

Type 13. SPIA (signaling pathway impact analysis)

① Shows signaling pathway topology related to submitted genes

⒂ Others

① WP: Can be checked using web-based tools like g:Profiler

② TF: Can be checked using web-based tools like g:Profiler

③ GSVA

④ AUCell package

⑤ HOMER motif analysis

⑥ Gold standard method



10. Common 4. Gene Interaction Analysis

Type 1. Ligand-Receptor Database

① Principle: Interaction between ligand and receptor in two cells when one cell has high ligand expression and the other has high receptor expression

1-1. NicheNet

1-2. CellphoneDB (tutorial)

1-3. CellChat

1-4. CellTalkDB

Type 2. Network Database

2-1. Biological network construction

Type 1. Gene regulatory network

Type 2. Protein-protein interaction network

Type 3. Co-expression network

○ Networks can be constructed based on biological interactions or gene expression data

○ Biological significance of genes can be uncovered using various network measures/metrics after network construction

Example 1: ToppNet

Example 2: ToppGenet

Example 3: GeneMANIA

Type 3. Hub Gene Detection

① Types of Networks

○ Degree Centrality

○ Betweenness Centrality

○ Closeness Centrality

○ Eigenvector Centrality

○ Participation Coefficient

○ Pagerank

② Extracting Core Genes within Modules/Communities through Various Metrics for Finding Hubs in Network Analysis Techniques

Type 4. PiNET

① Annotate, map, and analyze peptide moieties


11. Common 5. Cell Type Mapping Analysis

General cell type annotation

① Purpose

Significance 1. Cell type analysis in scRNA-seq and in-depth analysis of ROI in ST can effectively mitigate batch effects caused by sample selection bias.

Significance 2. Analyzing based on cell types rather than gene expression analysis makes the results easier to understand.

Method 1. Clustering-based

○ Definition: Clustering scRNA-seq data and then labeling cell types based on the differential expression of genes within each cluster.

Drawback 1. Divided by resolution parameter: Not all cells in the same cluster necessarily belong to the same cell type.

Drawback 2. Cells of the same type might be split into different clusters depending on their states.

Drawback 3. Comparing cell type labels across different platforms is challenging (e.g., immune cell vs DC vs cDC)

Method 2. Meta-tagging

○ Definition: Attaching multiple cell types to a cell based on scores or multiple rounds of clustering.

○ Example: Labeling a single cell as immune cell, DC, and cDC simultaneously.

Method 3. Foundation Model Approach - Using language models to create a single labeler for specifying cell types.

GeneFormer: Based on BERT. Uses a transformer encoder-based architecture. Utilized with a pretraining → finetuning approach. Zero-shot capabilities are practically useless.

scGPT: Based on GPT. Uses a transformer decoder-based architecture. Utilized with a pretraining → finetuning approach. The zero-shot performance of the pretraining model is also quite excellent.

GenePT and Simple use of GPT-4

UCE (Universal Cell Embeddings)

scfoundation

CELLama

⑵ Bulk RNA-seq

xCell: Based on R. Defines weights for each human gene.

immunedeconv: A benchmarking algorithm based on R.

○ For human data: Modes available include quantiseq, timer, cibersort, cibersort_abs, mcp_counter, xcell, epic, abis, consensus_tme, and estimate.

○ For mouse data: Modes available include mmcp_counter, seqimmucc, dcq, and base.

⑶ scRNA-seq (Single-cell RNA sequencing)

Seurat: Based on R. Employs a method called ‘Label Transfer.’

② scanpy: Python-based, especially related to cell type analysis through ingest.

Scanorama: Python-based.

sc-type: Based on R. Utilizes pre-defined gene sets for each cell type.

celltypist and celltypist2: Python-based. Saving pre-defined references in a .pkl file.

scTab

⑷ ST (Spatial Transcriptomics)

CellDART and spSeudoMap

RCTD: Based on R.

③ MIA analysis: Includes both enrichment and depletion analyses.

Cell2location

⑤ SPOTlight

⑥ DSTG

CellTrek: After co-embedding scRNA-seq and ST, it performs cell type labeling using a distance-based graph and random forest.

⑧ TIMER

⑨ CIBERSORT

⑩ MCP-counter

⑪ Quantiseq

⑫ EPIC

⑬ CytoSpace

⑭ Tangram

⑮ BayesPrism

⑯ DestVI

⑰ Stereoscope



12. Advanced 1. Alternative Splicing Analysis (AS Analysis)

⑴ Overview

① AS analysis can be conducted using existing data with splice-aware aligners.

② However, with the emergence of long-read sequencing, chosen as the technology of the year in 2022, more accurate analysis becomes possible.

Long-Read Sequencing

① Fewer sequencing gaps compared to short-read sequencing.


image

Figure 9. Long-Read Sequencing vs. Short-Read Sequencing


Advantage 1. AS Analysis: Enables identification of alternative splicing events and isoforms.

Advantage 2. Facilitates the integration of epigenetics and transcriptomics.

Example 1. Pacific Biosciences SMRT (Single Molecule Real-Time) Sequencing: Average read length is ~20 kb.

Example 2. Oxford Nanopore Sequencing: Average read length is ~100 kb.

⑶ (Reference) Alternative Splicing Event

SE (Skipped Exon): Entire specific exon is either included or excluded.

A5SS (Alternative 5’ or 3’ Splice Site): Different usage of 5’ or 3’ splice junctions of an exon, not the whole exon.

MXE (Mutually Exclusive Exon): Exclusively spliced exon when another exon is spliced and vice versa.

RI (Retained Intron): Intron not coding for amino acids is retained or spliced.

⑷ (Reference) Exon Sequencing: Often contrasted with gene sequencing.

① Exon symbols examples:

○ chr15:63553600-63553679:-

○ chr15:56967876-56968046:-

○ chr7:7601136-7601288:+

○ chr11:220452-220552:-

② Gene symbol examples:

○ Human Genes: SIRPA, HBB-BS, etc.

○ Mouse Genes: Sirpa, Hbb-bs, etc.

Type 1. Event-Based AS Quantification: Quantification by dividing into exons using a count-based model.

① PSI Value: Quantifies PSI for each exon event.

○ Quantifies AS events using Percent-Splice-In (PSI) value.

○ PSI value = inclusion reads / (inclusion reads + exclusion reads)

○ Representative Tool: rMATs

② Exon Usage: Analyzed based on exon counts.

○ Reads can be classified into exon reads and junction reads based on mapped regions.

○ Exon reads: Mapped reads within exon regions.

○ Junction reads: Mapped reads at splice junctions.

○ Exon reads are counted per exon for exon usage (exon-level expression) calculation.

○ Representative Tool: DEXseq

Type 2. Isoform-Based AS Quantification

① Definition: Estimate expression of each isoform transcript at the transcript level using statistical models.

② Purpose: Improved efficacy in constraints, diagnostics, etc., if target isoforms can be defined.

③ Representative Tool: RSEM

④ Databases Used for Isoform Exploration

UniProt: Most well-known protein-related database.

Ensembl

Reactome

GPP Web Portal

NCBI Genome Data Viewer

○ NCBI assembly (example): Execute code targeting FASTA and GTF files (e.g., pyfaidx)



13. Advanced 2. Trajectory Analysis

⑴ Overview

① CNV (copy number variation): Abnormal number of chromosomes caused by cell division abnormalities. Refers to chromosome deletion or aneuploidy.

② SNP (single-nucleotide polymorphism): A difference in specific nucleotide sequences.

③ This is only possible with direct sequencing methods (e.g., Visium FF, scRNA-seq), not with probe-based methods (e.g., Visium FFPE).

④ In CNV analysis, p refers to the short arm, and q refers to the long arm.

⑵ CNV Analysis Algorithms

① CopywriteR: WGS-based. Analyzes off-target read depth.

② CNVkit: WGS-based. Analyzes deviations in read depth.

③ ASCAT: WGS-based

④ Ginkgo: scDNA-seq-based. Analyzes deviations in read depth.

⑤ InferCNV: scRNA-seq-based.

⑥ CopyKat: scRNA-seq-based.

⑦ CONICSmat: scRNA-seq-based.

⑧ HoneyBADGER: scRNA-seq-based.

⑨ CaSpER: scRNA-seq-based.

⑩ Numbat: scRNA-seq-based.

⑪ SpatialInferCNV: ST-based.

⑫ STARCH: ST-based.

⑬ CalicoST: ST-based.


image

Table 5. Summary of CNV Analysis Algorithms


⑶ SNP Analysis Algorithms

① SCmut: scRNA-seq-based.

② scSNV: scRNA-seq-based.

③ SComatic: scRNA-seq-based.

④ STmut: ST-based.

⑷ Trajectory analysis pipeline

① gene expression along pseudotime: Monocle (ref1, ref2, ref3), TSCAN (ref), Slingshot (ref)

② cell abundance along pseudotime: milo (ref), DAseq (ref)

③ trajectory lineage: tradeSeq (ref), LinRace (ref), SCORPIUS (for scRNA-seq) (ref)

④ Trajectory analysis in samples from different batches: Phenopath (ref), Condiments (ref), Lamian (ref)

⑤ RNA velocity (spliced vs unspliced): Velocyte, scVelo, STARsolo, dynamo, MultiVelo (ref), SPATA (ST-based)

⑸ Phasing

① Definition: The process of mapping each RNA transcript to either the paternal or maternal chromosome, or at least to distinguishable haplotypes.

② Primarily studied in F1 hybrid mouse models: RNA transcripts are mapped to each haplotype individually or to a combined reference based on SNPs.

③ Types of RNA-seq phasing: Lapels and Suspender pipeline, Eagle2, SHAPEIT, WhatsHap.

④ Types of Hi-C phasing: HARP, HaploHiC, ASHIC, HiCHap (Ohm).



14. Advanced 3. Epigenomics Analysis

Type 1. Gene Function Identification

① Perturb-seq: Treating Cas9 expressing cells with different gRNA libraries, followed by simultaneous sequencing of gRNA and mRNA.

Type 2. Transcription Regulation Identification

① BS-seq (Bisulfite Sequencing): Identifying methylation patterns.

② ChIP-seq (Chromatin Immunoprecipitation Sequencing): Identifying binding sites of transcription factors.

③ Hi-C seq (High Throughput Chromatin Conformation Capture Sequencing): 3D folding structure information of nuclear chromatin.

④ DNA Ticker Tape (Prime Editing)

⑤ ENGRAM (Enhancer-Driven Genomic Recording of Transcriptional Activity in Multiplex)

⑥ Spatial-CUT&Tag

⑦ Multi-CUT&Tag

⑧ MuLTI-Tag

⑨ ATAC-seq

⑩ NOMe-seq (Nucleosome Occupancy and Methylome Sequencing)

⑪ DNaseI-seq

⑫ MNase-seq

⑬ FAIRE-seq

⑭ MBD-seq

⑮ ChIA-PET

Type 3. Post-Translational Regulation Identification

① scRibo-seq

② STAMP-RBP

Type 4. Programmable Cell Function

① RADARS

LADL (light-activated dynamic looping): photo-activatable gene expression



15. Advanced 4. Special Transcriptomics Analysis

⑴ Spatial Transcription Factor Analysis (ref1, ref2)

① Barcode-based (spot-based) transcription factors: 10x Visium, etc.

② Image-based (FISH-based) transcription factors: 10x Xenium, Vizgen MERSCOPE, Nanostring CosMx, etc.

○ Patent dispute between Nanostring and 10x Genomics (‘23) (ref1, ref2) → Nanostring bankruptcy (ref) and acquisition (ref)

③ ST analysis downstream pipeline

Comprehensive Pipeline: Seurat, Squidpy, Scanpy, Giotto, SpatialData

○ Identifying spatial domains: SpaCell, STAGATE, GraphST, stLearn, RESEPT, Spatial-MGCN, SpaGCN, ECNN, MGCN, SEDR, JSTA, STGNNks, conST, CCST, BayesSpace, DRSC, Giotto-H, Giotto-HM, Giotto-KM, Giotto-LD, Seurat-LV, Seurat-LVM, Seurat-SLM


Algorithm Spatial Information Required Histology Information Required Cluster Number Required Program Language
BayesSpace Yes No Yes R
DRSC optional No optional R
Giotto-H No No Yes R
Giotto-HM Yes No Yes R
Giotto-KM No No Yes R
Giotto-LD No No No R
Seurat-LV No No No R
Seurat-LVM No No No R
Seurat-SLM No No No R
SpaCell No Yes Yes Python
SpaCell-G No No Yes Python
SpaCell-I No Yes Yes Python
SpaGCN Yes No optional Python
SpaGCN+ Yes Yes optional Python
stLearn Yes Yes No Python

Table 6. Spatial Clustering Methods


○ Identifying spatially variable genes (SVGs): SPARK, SPARK-X, SpatialDE, SpaGCN, ST-Net, STAGATE, HisToGene, CoSTA, CNNTL, SPADE, DeepSpaCE, conST, MGCN, STGNNks, SpatialScope, nnSVG, Hotspot, MERINGUE

○ SpatialDE : Gaussian process regression

○ SPARK : generalized Poisson regression

○ SPARK-X : non-parametric method

○ MERINGUE : spatial autocorrelation measure

○ nnSVG : nearest-neighbor Gaussian process

○ Enhancing gene expression resolution (GER): XFuse, DeepSpaCE, HisToGene, SuperST, TESLA, iStar

○ Gene imputation: LIGER, SpaGE, stPlus, Seurat, Tangram, gimvI, GeneDART

○ Cell-cell interaction (CCI): Giotto, MISTy, stLearn, GCNG, conST, COMMOT, NCEM, spatial variance component analysis, Tangram

○ Cell type deconvolution: MIA, Stereoscope, RCTD, cell2location, DestVI, STdeconvolve, SPOTlight, SpatialDWLS, GIST, GraphST, DSTG, Tangram, CellDART, Tacco, Smoother, CARD, Celloscope, Starfysh

○ Cell segmentation: watershed, CellPose, JSTA, Baysor, GeneSegNet

○ Image alignment: DiPY, bUnwarpJ (ImageJ), STalign

○ CNV inference: SpatialInferCNV, SPATA


image

Figure 10. ST analysis downstream pipeline


④ 3D spatial transcriptomics (ref1, ref2, ref3)

⑤ subcellular ST: nearest-neighbor, InSTAnT, TopACT (cell type classification), APEX-seq-based atlas, Bento, TEMPOmap, subcellular mRNA kinetic modeling

⑵ Temporal Transcription Factor Analysis

Type 1. short time series data

○ STEM (short time-series expression miner)

○ TimesVector

Type 2. Long time series data: similar to analysis of general experimental groups

Type 3. Temporal sequencing

Record-seq

Live-seq

TMI

molecular recording

⑶ Spatiotemporal Omics

ORBIT (single-molecule DNA origami rotation measurement)

② 4D spatiotemporal MRI or hyperpolarized MR

in vivo 4D omics with transparent mice



16. Advanced 5. Database Utilization

⑴ Bioinformatics Resources

① Examples: PubMed, NCBI, bioRxiv, BioStars, Bioinformatics Stack Exchange, Stack Overflow

② Data Repositories: ArrayExpress, Gene Expression Omnibus (GEO), GenomeRNAi, dbGAP, The European Genome-phenome Archive (EGA), Database of Interacting Proteins (DIP), IntAct, Japanese Genotype-phenotype Archive (JGA), NCBI PubChem BioAssay, Genomic Expression Archive (GEA), GWAS Catalog

Webpage Crawling

Small Molecule Database

Integrated Small Molecule Database: Database providing data on the physiological activity of about 800,000 small molecules in vector format

AlphaFold2 Database: Database with structural data of 200 million proteins

ensembl: Transcriptome database

uniprot: Protein database

The Human Protein Atlas: Public access resource aiming to map all human proteins in cells, tissues, and organs

SGC (Chemical Probes): Provides a unique probe collection along with related data, control compounds, and usage recommendations

⑶ Spatial Transcriptomics database

HCA

HuBMAP

SODB

STOmicsDB

SpatialDB

SOAR

HTAN

Allen Brain Map

Drug Genomics Database

① NCBI dbSNP

② gnomAD

③ pharmVar

④ PHARMGKB

⑤ NCBI PubChem

⑥ Broad Institute CMAP

⑦ CTD

⑧ Comptox

⑨ Drug Bank

⑩ Stitch (search tool for interactions of chemicals)

ToppFun

⑫ depmap

⑬ L1000CDS2

⑭ L1000FWD

⑮ GDSC (Genomic of Drug Sensitivity in Cancer)

⑯ CCLE

ClinicalTrials.gov: Provides information on clinical trial progress for each drug

⑸ Clinical and Non-clinical Databases

TCGA (The Cancer Genome Atlas)

○ Diverse human tumor profiling data based on DNA, RNA, protein expression, and epigenetic factors

How to obtain TCGA data

GWAS Catalog

○ Provides curated information and educational resources on genome-related information to identify causal variants and understand disease mechanisms for new therapies

MGI (Mouse Genome Informatics)

○ Database collecting mouse mutation, phenotype, and disease data. Each gene ontology (GO) is well organized

Open Targets Platform

○ Integrated platform for phenotype data such as expression, co-localization, and prioritization signature related to specific targets associated with certain diseases

⑤ UK Biobank

○ Metabolome: Sample of 120,000. Measured from blood collected between 2006 and 2010

○ Blood Biomarkers: Sample of 500,000. Measured from blood collected between 2006 and 2010

○ Genome (GWAS, WES, WGS): Sample of 500,000. Measured from blood collected between 2006 and 2010

○ Summary-level Clinical Records: Sample of 500,000. Hospital diagnoses (ICD codes) and date of first diagnosis

○ Record-level Clinical Records: Sample of 250,000. Date-specific diagnosis/prescription records. The start year for tracking is as follows

○ 1997 for England

○ 1998 for Wales

○ 1981 for Scotland

⑥ Other Clinical Databases


Country Institution Clinical Data Genomic Data Transcriptomic Data Proteomic Data Imaging Data
USA PRIMED O O      
USA All of US O O      
USA National Institute of Diabetes and Digestive and Kidney Diseases O O      
USA Slim Initiative in Genomic Medicine for the Americas O O      
UK UK Biobank O O O O O
China Kadoorie Biobank O O   O O

Table 7. Other Clinical Databases


⑹ Policy Databases

EQIPD Quality System: New non-clinical research quality system developed by the European Quality in Preclinical Data (EQIPD) consortium composed of 8 countries and 29 institutions, applicable to both public and private sectors

FAIRsharing: Curated information and educational resources on database and data policy-related data and metadata standards

⑺ Networking Platforms

Chemicalprobes.org: Portal site to receive expert advice on finding and using chemical probes in pharmaceutical research and drug development

European Lead Factory: Collaborative public-private partnership site for innovative drug development

Genotype-Tissue Expression project: Public resource project for tissue-specific gene expression and regulation

GOT-IT Expert Platform : Platform facilitating exchange between academic researchers and industry experts to promote new academia-industry collaborations

SPARK Global Initiative: International network focusing on exchanging specialized knowledge and addressing immediate unmet medical needs, enhancing and developing projects



Input: 2021.10.02 13:49

Modification: 2023.07.11 11:19

results matching ""

    No results matching ""