Transcriptomics Analysis Pipeline

Recommended post: 【Bioinformatics】 Bioinformatics Analysis Table of Contents

1. QC 1. Experimental QC

2. QC 2. Data QC

3. QC 3. Filtering

4. QC 4. Alignment

5. QC 5. Normalization

6. QC 6. Batch Effect

7. Common 1. Clustering

8. Common 2. Differentially Expressed Gene (DEG) Analysis

9. Common 3. Gene Set Enrichment

10. Common 4. Gene Interaction Analysis

11. Common 5. Cell Type Mapping Analysis

12. Advanced 1. Alternative Splicing Analysis (AS Analysis)

13. Advanced 2. Trajectory Analysis

14. Advanced 3. Epigenomics Analysis

15. Advanced 4. Special Transcriptpmics Analysis

16. Advanced 5. Database Utilization

a. Sequencing Technologies

b. Bioinformatics

c. Seurat Pipeline

d. Proteomics Pipeline

Figure 1. Analysis Guides

1. `QC 1.` Experimental QC (Sample Level Quality Control)

⑴ Definition: Metric for assessing tissue quality

⑵ RNA-seq methodology

① Step 1. Purify RNA: DNAse treatment to remove genomic content

② Step 2. Enrich for mRNA using polyA selection

③ Step 3. Fragment RNA to library insert size (200-400 nt)

④ Step 4. Convert RNA to cDNA

⑤ Step 5. Prepare cDNA for sequencing (ligate adapters, amplify, add barcodes)

⑥ Step 6. Sequence one or both ends of fragments (usually 50, 100, or 150 nt per read)

⑦ Step 7. Map to genome

⑶ Type 1. RIN (RNA Integrity Number)

① Background Knowledge: mRNA comprises less than 3% of total RNA. rRNA makes up more than 80% (mainly 28S [2 kb] and 18S [5 kb] in eukaryotic cells).

② Measured by the Agilent 2100 Bioanalyzer.

③ RIN Algorithm: It uses features such as the ratio of 18S to total RNA, the ratio of 28S to total RNA, and the 18S normalized height.

④ RIN = 10: Intact RNA.

⑤ RIN = 1: Completely degraded RNA.

⑥ RIN > 7: Generally considered suitable quality level for RNA-seq.

⑷ Type 2. DV200: Percentage of fragments around 200 nt in FFPE tissues, indicating RNA fragmentation

⑸ Type 3. Nucleic Acid Purity Quantification via Absorbance Ratio

① 260 nm / 280 nm Ratio

○ Pure DNA: ~1.8

○ Pure RNA: ~2.0

○ A low 260 nm / 280 nm ratio suggests the presence of proteins or phenol, which absorb at 280 nm.

② 260 nm / 230 nm Ratio

○ Pure DNA/RNA molecules: ~2-2.2

○ A low 260 nm / 230 nm ratio indicates the presence of other contaminants.

⑹ Type 4. Nucleic Acid Weight Quantification

① Extract RNA using miRNeasy Mini Kit (QIAGEN) or similar methods.

② RNA weight criterion: At least 250 ng.

⑺ Type 5. RNA Quality Score (RQS)

⑻ Type 6. ChIP-seq Experimental QC

① ChIP grade monoclonal antibody: Pre-tested with ChIP-seq. 20-30% of the commercially produced antibodies tested were unsatisfactory for ChIP-seq.

② qPCR: Best to test positive control region(s) using the ChIP sample. Hope to detect >10-12x fold enrichment over IgG (non-specific antibody).

③ Biotinylated transcription factor: Permits factor pull-down on streptavidin. Independent of antibodies.

④ CUT&RUN, CUT&Tag, and ChIP-exo are methods to improve the resolution and signal-to-noise ratio of peaks in ChIP-seq.

⑼ Type 7. ATAC-seq Experimental QC

① Tn5 Concentration: Higher Tn5 concentration relative to DNA concentration increases ATAC-seq signal intensity at promoters and enhancers while reducing fragment size.

② Sequencing Lane Cluster Density: Shifts fragment length distribution and TSS enrichment.

⑽ Type 8. Spatial Transcriptomics Experimental QC (ref)

Table 1. Spatial Transcriptomics Experimental QC

① Fresh-frozen tissue has higher RNA integrity than FFPE, but poorer tissue morphology quality.

② For FFPE samples, DV200 is a more appropriate metric.

③ Image-based spatial transcriptomics platforms like Xenium and CosMx can better tolerate RNA degradation compared to spot-based ST methods.

2. `QC 2.` Data QC (Sequence Level Quality Control)

⑴ Definition: Metric for assessing data quality

⑵ Type 1: QC metrics for the data itself: Used for external validity confirmation by manual inspection or comparing with other datasets

① QC metrics

○ dUTP method: “_1.fastq” represents first strand (anti-sense), “_2.fastq” represents second strand (sense; original RNA sequence)

○ base quality

○ mapping rate

○ mappability filter

○ Type 1. uniqueness: How unique each sequence is starting at a particualr base and of a particular length

○ Type 2. alignability: How uniquely k-mer sequences align to a region of the genome (up to 2 mismatches allowed)

○ mappability score : S = 1 / # of matches found in genome

○ Long reads can resolve mapping issue among highly similar regions: Some regions of the genome are troublesome regardless of read length.

○ Non-coding RNA ratio

○ High non-coding RNA ratio and high duplication indicate poor RNA quality

○ GC content

○ High GC content: Suggests potential rRNA contamination; requires filtering of 5S, 18S, 28S rRNA

○ Low GC content: Indicates potential issues with reverse transcription

○ Related to CpG island.

○ Read duplicate

○ PCR duplicate: A duplicate that is merely a replication of the same nucleic acid molecule during PCR.

○ If the sequences are exactly the same, they are considered PCR duplicates. If they are similar, they are considered biological duplicates.

○ In paired-end experiments, duplicates occur at the paired-end level.

○ Generally, DNA-seq involves removing duplicates, while RNA-seq does not: In RNA-seq, the same sequence may repeatedly appear not only due to technical duplicates but also due to high expressing transcripts or short genes. Removing these biological duplicates can reduce the dynamic range of the data or decrease statistical power.

○ The likelihood of increased duplication rate may rise with the number of cycles during the PCR process: By checking the correlation between the duplicate rate and the number of PCR cycles, one can distinguish between technical and biological duplicates.

○ Tools for removing duplicates: Samtools, Picard, Trimmomatic, Trim Galore!, fastp

○ Unique molecule(UMI)

○ RNA quality is low if unique molecule count is below 10%

○ Sequencing depth

○ # of reads

○ In cases of alternative splicing or allele-specific expression: >50 million reads are recommended

○ DEG analysis of a ribo-depleted library: Approximately 50-60 million total reads are recommended

○ A ribo-depleted library is recommended to have twice the sequencing depth compared to a poly-A selected library because the ribo-depleted library can capture a wider variety of RNA (e.g., tRNA, rRNA, immature RNA) that the poly-A selected library cannot capture.

○ Sequencing read length

○ Small fragments: Adaptor binding starts, requiring adaptor trimming.

Figure 2. The reason why adaptors are read when fragments are short

○ Advantages of long-read sequencing compared to short-read sequencing: lower cost per nucleotide, more accurate mapping, ability to identify splice junctions, capability to detect allele-specific expression, ability to resolve repetitive sequences.

○ Disadvantages of long-read sequencing compared to short-read sequencing: higher overall cost, higher cost per read, requires more adaptor trimming. Long-read sequencing (PacBio, Oxford Nanopore) is more likely to contain adapter sequences due to its sequencing method, which involves repetitive reading and single-molecule sequencing.

○ Adaptor sequence

○ Issue 1: Since adapter sequences are artificial, they can cause alignment and variant calling to fail or introduce biases.

○ Issue 2: Adapter sequences contain identical sequences, leading to biases in coverage analysis and differential gene expression (DEG) analysis.

○ To remove adapter sequences, tools such as AdaptorRemoval, Cutadapt, Trimmomatic, and bbduk.sh are used.

○ Exonic portion

○ In the case of poly-A(+) RNA-seq, exonic reads account for 50 to 70%.

○ In the case of rRNA(-) RNA-seq, the proportion of exonic reads decreases.

○ Paired-end vs Single-end

○ Single end reads: Each library fragment is sequenced only from a single end.

○ Paired end reads: Each library fragment is sequenced from both ends.

○ Paired-end (PE) reads are more accurate than single-end (SE) reads but are approximately twice as expensive.

○ If the goal is simply to calculate gene counts for DEG analysis, SE is sufficient.

○ SE is recommended when RNA is significantly degraded.

○ It’s better to avoid using PE on short fragments to prevent inefficiencies caused by sequencing the same nucleotide.

Figure 3. Inefficiencies that can occur in paired-end (PE) sequencing

○ Mate pair reads

○ A longer segment of DNA is circularized and reads from the joint region are sequenced (both ends).

○ Strand-specific (ssRNA-seq)

○ Can be either forward or reverse.

○ Poly-A selection vs Ribo-depletion

○ Advantages of a ribo-depletion library

○ Works even with RNA degradation: cDNA fragments are uneven and short. Poly-A selection is highly biased toward the 3’ end, making it less accurate.

○ Suitable for studying non-coding RNA.

○ Disadvantages of a ribo-depletion library

○ Expensive.

○ Includes a large number of meaningless reads.

② QC metrics for ChIP-seq

○ Mapping ratio

○ Read depth: ENCODE recommends ≥10 million uniquely mapped reads for TFs (histone modifications).

○ Library complexity

○ Background uniformity (biasedness)

○ GC summit bias

○ qPCR enrichment

○ Fragment size distribution

○ Input DNA qualuty via NanoDrop

○ Cross-correlation analysis: NSC (normalized strand coefficient), RSC (relative strand correlation)

○ FRiP (fraction of reads in peaks) (ref1, ref2), RUP(reads under peaks): Proportion of reads in a ChIP-seq dataset that falls into a peak. ENCODE recommends FRiP (RUP) ≥ 1%.

○ SPOT(signal portion in tags): Indicates good signal-to-noise ratio.

○ IDR (irreproducibility discovery rate) (ref1, ref2)

○ denQCi, simQCi, QC-STAMP (ref)

○ Motif analysis: What % of peaks contain the TF motif, and does the motif tend to occur in the middle of the peak? Not expected for all peaks, because a TF may bind as part of a protein complex or a heterodimer.

○ Fragment length distribution: This refers to the distribution of fragment lengths, such as single nucleosome, dimer, trimer, etc. It is relevant for techniques like CUT&Tag and ATAC-seq.

○ IP enrichment strength: A graph representing the ratio of reads to bin proportions. Tools like deepTools or CHANCE can be used for analysis. This takes advantage of the fact that the peak ratio appears narrowly distributed across the entire window.

○ Strand lag: The actual binding site signal exhibits a lag between signals on the + strand and - strand. Tools like ZINBA and PePR remove artifact signals that do not exhibit strand lag.

③ QC metrics for ATAC-seq

○ FastQC: For example, “Per base sequence content” can be used to assess the integration bias of Tn5 transposase.

○ The ataqv package provides 35 QC metrics as follows (ref): fragment length distribution, % reads that are high-quality and autosomal, % reads properly paired end mapped, % reads that aligned to autosomes that were duplicates, short-to-mononucleosomal-ratio, TSS enrichment, duplicate fraction in peaks, duplicate fraction outside of peaks, peak duplicate ratio, cumulative fraction of high-quality autosomal reads in peaks, cumulative fraction of the genome that falls within peaks, distribution of mapping qualities, number of total reads, % alignments marked secondary, % alignments marked supplementary, % alignments marked as duplicates, mean mapping quality, median mapping quality, % reads unmapped, % reads with an unmapped mate, % QC-fail reads,% unpaired reads, % reads with mapping quality 0, % reads that paired and mapped but in RF orientation, % reads that paired and mapped but in FF orientation, % reads that paired and mapped but in RR orientation, % reads that paired and mapped but on separate chromosomes, % reads that paired and mapped but too far from mate, % reads that paired and mapped but not properly, % reads that aligned to autosomes, % reads that aligned to mitochondria, % reads that aligned to mitochondria that were duplicate, number of peaks called, fragment length distribution distance, max fraction of reads from a single autosome

④ Method 1. Other datasets: 10x Genomics, GEO, ZENODO, etc.

⑤ Method 2. FastQC: Requires Fastq files as input.

○ 2-1. FastQC and multiQC: Most popular

○ Base pair quality of reads

○ Adaptor sequences in reads

○ PCR duplicates

○ Overrepresented sequences

○ GC distribution for each sample

○ 2-2. QoRTs (ref1, ref2): Very good

○ RNA degradation: Distribution of reads 5’ → 3’

○ Strandedness check

○ GC bias

○ 2-3. RNASeQC: Decent

○ 2-4. RSeQC: Used to have major bugs

○ 2-5. Use conda Fastqc command (Linux)

○ 2-6. Download SRA (Sequence Reads Archive) Toolkit and use fastqc command

○ Below is an example of generated files.

sudo apt install fastqc
cd sratoolkit.3.0.5-ubuntu64/
cd bin
fastqc DRR016938.fastq

○ Example file

⑥ Method 3. Trimmomatic: Takes Fastq files as input.

⑦ Method 4. FASTX-Toolkit: Takes Fastq files as input.

⑧ Method 5. QC after mapping: Takes SAM or BAM files as input.

○ QC metric

○ % uniquely mapped reads

○ % reads mapping to exons

○ Complexity, i.e. x% of read counts being taken up by y% of genes

○ Consistency across samples

○ Sample swap: Match Y chromosome, Xist, genotype (e.g., SNP) with metada.

○ 5-1. Qplot

○ 5-2. Samtools:

⑨ Method 6. SnakeMake: Integrated pipeline with QC functionality

○ Snakefile: A Snakemake script based on Python. The filename itself is Snakefile.

# Snakefile

# Define the result file
rule all:
    input:
        "results/processed_data.tsv"

# Data processing rule
rule process_data:
    input:
        "data/raw_data.tsv"
    output:
        "results/processed_data.tsv"
    shell:
        """
        cat {input} | awk -F'\t' '' > {output}
        """

○ config.yaml (optional): Configuration for the Snakemake workflow (ref)

○ requirements.txt (optional): Package dependencies

○ Input file

○ Output file

⑩ Method 7. QuASAR-QC: Applicable for Hi-C data.

⑪ Trouble-shooting

⑶ Type 2: Comparison or reproducibility metrics between samples: Used for internal validity assessment

① Purpose 1: Evaluate sample quality by examining ranks of two correlated variables within a single sample

○ Example: Investigate alignment of expression levels of two known similar genes

○ Example: Examine if two genes known to be similar appear in the same cluster

② Purpose 2: Often used to examine correspondence between pairs of samples

③ Purpose 3: Investigate correlation between two variables with different data distribution characteristics

○ Concept somewhat distinct from QC analysis

○ Example: Correlation coefficient between gene A expression in scRNA-seq and A expression in ST (Spatial Transcriptomics)

④ Method 1. Pearson correlation coefficient

○ Definition: For standard deviations σx and σy of X and Y,

○ Features

○ Correlation between two variables measured on interval or ratio scales

○ Focused on continuous variables

○ Assumption of normality

○ Widely used in most cases

○ Calculation in RStudio

○ cor(x, y)

○ cor(x, y, method = "pearson")

○ cor.test(x, y)

○ cor.test(x, y, method = "pearson")

⑤ Method 2. Spearman correlation coefficient

○ Definition: Defined for x’ = rank(x) and y’ = rank(x)

○ Features

○ A method for measuring the correlation between two variables on an ordinal scale

○ A non-parametric approach for ordinal variables

○ Advantageous for data with many zeros

○ Sensitive to deviations or errors in the data

○ Yields higher values compared to Kendall’s correlation coefficient

○ Useful formula

○ Calculation in RStudio

○ cor(x, y, method = "spearman")

○ cor.test(x, y, method = "spearman")

⑥ Method 3. Kendall correlation coefficient

○ Definition: Defined using concordant and discordant pairs

○ Features

○ A method for measuring the correlation between two variables on an ordinal scale

○ A non-parametric approach for ordinal variables

○ Suitable for data with many zeros

○ Useful for small sample sizes or when there are many tied ranks in the data

○ Procedure

○ Step 1: Sort the y values in ascending order based on the x values. Represent each y value as y_i.

○ Step 2: For each y_i, count the number of concordant pairs where y_j > y_i (where j > i).

○ Step 3: For each y_i, count the number of discordant pairs where y_j < y_i (where j > i).

○ Step 4: Define the correlation coefficient.

○ n_c: total number of concordant pairs

○ n_d: total number of discordant pairs

○ n: size of x and y

○ Calculation in RStudio

○ cor(x, y, method = "kendall")

○ cor.test(x, y, method = "kendall")

⑦ Method 4. Q-Q plot between empirical CDFs (Cumulative Distribution Functions)

⑧ Method 5. Q-Q plot between ordered p-values

⑨ For Hi-C sequencing, available methods include HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep.

3. `QC 3.` Filtering

⑴ Type 1: Filtering during data processing

① Removal of inappropriate reads from sequencing data to generate trimmed data

② Example 1: Adapter Sequence

○ Issue 1: Since adapter sequences are artificial, they can cause alignment and variant calling failures or introduce biases.

○ Issue 2: Adapter sequences contain identical sequences, leading to biases in coverage analysis and differential gene expression (DEG) analysis.

○ If the fragment is too short, the adapter sequence starts to be read. → In this case, adapter trimming is performed.

○ Long-read sequencing requires more adapter trimming. → Long-read sequencing (PacBio, Oxford Nanopore) is more likely to contain adapter sequences due to its sequencing method, which involves repetitive reading and single-molecule sequencing.

○ To remove adapter sequences, tools such as AdaptorRemoval, Cutadapt, Trimmomatic, and bbduk.sh are used.

○ Adaptor trimming method: Align each paired-end read and remove the differing parts.

Figure 4. Adaptor trimming method

⑵ Type 2: Filtering in downstream analysis

① 2-1. Gene filtering: Excluding genes with low expression to avoid capturing non-informative or unrealistically DEG genes

② 2-2. Barcode filtering

○ Case 1: Low RNA expression (e.g., poor QC results)

○ Case 2: High RNA expression (e.g., contamination due to RNA diffusion)

○ Case 3. Doublet removal: scdDblFinder (R-based), hybrid score of scds (R-based), doubletCells of scran (R-based), DoubletFinder

③ Can be implemented using the subset function in the R Seurat pipeline, matrix operations in Python scanpy, etc.

4. `QC 4.` Alignment

⑴ Definition: Process of finding the location of a specific sequencing read within a genome or transcriptome

⑵ Step 1: Raw data preparation: File extension is .Fastq

① Type 1: Single-end sequencing (SES): Sequencing using one side of the adapter

② Type 2: Paired-end sequencing (PES): Sequencing using both sides of the adapter

○ dUTP method standard: “_1.fastq” represents first strand (anti-sense), “_2.fastq” represents second strand (sense; original RNA sequence)

○ First, perform sequencing through one adapter (obtain Read1), and then perform sequencing through the opposite adapter (obtain Read2).

○ Read1 and Read2 from the same DNA fragment can be easily matched because they come from the same cluster.

○ Advantages: Higher accuracy (due to comparison between Read1 and Read2), easy extraction of DNA mutations, easy analysis of repetitive sequences, easy mapping between different species.

○ Disadvantages: Higher cost, requires more steps than SES.

③ Long-read sequencing

	Short-Read Seq	Long-Read Seq
Release Year	Early 2000s	Mid 2010s
Average Read Length	150-300 bp	5,000-10,000 bp
Accuracy	99.9%	95-99%

Table 2. The comparison of short-read seq and long-read seq

Figure 5. The difference between short-read seq and long-read seq

○ There are no sequencing gaps, allowing for the following analyses:

○ Advantage 1: Alternative Splicing Analysis: Identification of alternative splicing events and isoforms is possible.

○ Advantage 2: Copy Number Variation (CNV) Analysis: For example, the number of repeat sequences is critical in Huntington’s Disease.

○ Advantage 3: Facilitates the integration of epigenetics and transcriptomics.

○ Type 1: Pacific Biosciences SMRT (Single Molecule Real-Time) sequencing: The average read length is ~20 kb.

○ Type 2: Oxford Nanopore Sequencing: The average read length is ~100 kb.

⑶ Step 2. Preprocessing: The process of preprocessing raw sequencing data to extract valuable summarized information.

① Process 1. alignment

Figure 6. Alignment process

○ Overview

○ Definition: The process of determining how two or more sequences align with each other. In other words, pattern matching.

○ Motif: A specific pattern or subsequence that appears repeatedly in a given sequence. Motifs are often associated with biological functions.

○ Purpose 1. To identify the similarity between two sequences for use in processes such as mapping and assembly.

○ Purpose 2. Variation Identification: Detecting variations such as insertions, deletions, and substitutions.

○ Pattern matching: The concepts of PFM and PWM

○ Step 1. Construction of the PFM (Position Frequency Matrix)

○ x_ij: The count of nucleotide i occurring at the j-th position.

○ Step 2. Construction of the PWM (Position Weight Matrix)

○ PWM is the relative entropy or the Kullback-Leibler divergence of the PFM.

○ p_i: Pseudocount or Laplace estimator (e.g., 0.25).

○ q_i: Expected or background probability of observing nucleotide ( i ) (a priori) (e.g., 0.25).

○ p_i and q_i are determined based on information theory.

○ Step 3. Assign scores to each k-mer using the PWM.

○ Pattern matching: Gibbs sampling

○ Step 1: Initialization

○ 1-1. Randomly select k-mers, i.e., motifs, from each of the N sequences.

○ 1-2. Calculate the frequency of A, C, G, T nucleotides at each position.

○ 1-3. Consider all remaining sequences that are not selected as motifs to be the background.

○ 1-4. Construct the initial PWM.

Figure 7. Gibbs Sampling initialization

○ Step 2: Iteration

○ 2-1. Randomly select one of the N sequences.

○ 2-2. Construct a PWM using all sequences except the one selected.

○ 2-3. Calculate the score distribution by considering all possible motifs in the selected sequence.

○ 2-4. Probabilistically determine a new motif based on the score distribution.

○ 2-5. Repeat steps 2-1 to 2-4 until reaching the maximum number of iterations or until the information content does not change significantly.

Figure 8. Gibbs Sampling iteration

○ Pattern matching: Sequence Logo

○ A graphical representation of an amino acid or nucleic acid multiple sequence alignment.

○ Developed by Tom Schneider and Mike Stephens.

○ The y-axis represents information content, as defined in information theory.

○ Example 1: When all nucleotide sequences (A, T, G, C) have the same frequency: Maximum entropy = 2. Actual entropy = 2. Information content = 0

○ Example 2: When only one nucleotide is present: Maximum entropy = 2. Actual entropy = 0. Information content = 2

○ Example 3: When two nucleotides have the same frequency: Maximum entropy = 2. Actual entropy = 1. Information content = 1

○ Pattern matching, Compression: BWT(Burrows-Wheeler transform)

○ History: Centered around David Wheeler

○ First to earn a Ph.D. in Computer Science: 1951, University of Cambridge.

○ First to invent a programming language.

○ Invented the concept of functions in computer programming together with Maurice Wilkes and Stanley Gill.

○ Discovered BWT while working as a consultant at Bell Labs in the early 1980s.

○ Published BWT with Michael Burrows ten years later.

○ Step 1. When the input string is mississippi, append $ at the end to generate the matrix of all cyclic rotations.

○ Step 2. Consider $ as the first character and sort the matrix in alphabetical order to obtain the BWT matrix.

○ In the figure below, i is called the SA index, and due to Feature 3, it must be stored together.

Figure 9. How to calculate BWT matrix

○ Step 3. Apply run-length encoding to the last column of the BWT matrix to obtain the BWT transformation.

○ Last column of the BWT matrix: ipssm$pissii

○ Run-length encoding: ipssm$piissi or ip2sm$pi2s2i

○ Feature 1. It tends to organize a string that contains repeats into runs of consecutive characters. An efficient compression technique.

○ When constructing the BWT matrix, sorting in alphabetical order allows repetitive sequences to be easily grouped by the same alphabet.

○ Using this method, even the human genome, which contains many repetitive sequences, can be compressed to approximately 750 MB.

○ Feature 2. The BWT is invertible i.e. given the BWT of a string you can recover the original string without any additional information.

○ If the BWT transformation is ippssm$piissi, the first column of the BWT matrix becomes $iiiimppssss, which is the alphabetically sorted sequence of the BWT transformation.

○ The end of the input string is naturally $, and by looking at the first row of the BWT matrix, we can determine that the input string ends with i$.

○ In the BWT matrix, the row starting with i$ is easily identified as the second row, which is the first row that starts with i.

○ Following this method, we can identify the LF (last-first) property, which allows us to reconstruct the original input string.

○ LF property: For a given character, the k-th occurrence in the first column of the BWT matrix corresponds to the k-th occurrence in the last column.

○ Feature 3. Fast pattern matching with BWT: Assume the pattern is sis.

○ If the BWT transformation is ippssm$piissi, the first column of the BWT matrix becomes $iiiimppssss, which is the alphabetically sorted sequence of the BWT transformation.

○ To search for sis (first two letters si), find rows in the BWT matrix where the last column is s and the first column is i: SA index = 8, 5.

○ To search for sis (last two letters is), find rows in the BWT matrix where the last column is i and the first column is s: SA index = 6, 3.

○ Method 1. According to the LF property, for i to be common, the corresponding SA index values must be 5 and 6.

○ Method 2. The SA index values must differ by exactly 1, meaning they should be 5 and 6.

○ Method 3. Execute the following algorithm concerning the number of letters that appear alphabetically before it, count, and the frequency of that letter, occur. If multiple rows are output, it means multiple matches were found.

def BWT(string):
    string += '\n'    
    t = [string[i:] + string[:i] for i in range(len(string))]
    t.sort()    
    bwt_string = ''.join([l[-1] for l in t])
    return bwt_string

def suffix_array(string):
    string += '\n'
    sa = [index for suffix, index in sorted((string[i:], i) for i in range(len(string)))]    
    return sa

def cal_count(string):
    mydict = {}
    for char in string:
        mydict[char] = 0
    for char in string:
        for key in mydict.keys():
            if key > char:            
                mydict[key] += 1
    return mydict

def cal_occur(bwt_string):
    mydict = {}
    flags = {}
    for char in bwt_string:
        mydict[char] = []
        flags[char] = 0
    for char in bwt_string:
        for key in mydict.keys():
            if key == char:
                flags[key] += 1
                mydict[key].append(flags[key])
            else:
                mydict[key].append(flags[key])
    return mydict

def update_range(lower, upper, count, occur, a):
    if lower == 0:
        lower_new = count[a] + 0 + 1
    else:
        lower_new = count[a] + occur[a][lower-1] + 1
    upper_new = count[a] + occur[a][upper]
    return lower_new, upper_new

def find_match(query, reference):
  def reverse(s):
    return s[::-1]

  count = cal_count(reference)
  occur = cal_occur(BWT(reference))

  lower = 0
  upper = len(BWT(reference))-1
  for char in reverse(query):
     if lower > upper: 
       break #no matching
     lower, upper = update_range(lower, upper, count, occur, char)
  sa = suffix_array(reference)
  matched_positions = sa[lower:upper+1]
  
  return sorted(matched_positions)

'''
reference = mississippi
bwt_reference = 'ipssm$pissii'
query = 'sis'
count: {'i': 0, 'm': 3 , 'p': 4, 's': 6 }
occur: {'$': [0,0,0,0,0,1,1,1,1,1,1,1],
        'i': [1,1,1,1,1,1,1,2,2,2,3,4],
        'm': [0,0,0,0,1,1,1,1,1,1,1,1],
        'p': [0,1,1,1,1,1,2,2,2,2,2,2],
        's': [0,0,1,2,2,2,2,2,3,4,4,4]}
find_match('sis', 'mississippi')
# [3]
'''

○ Since sis corresponds to SA index 5, the matching pattern in mississippi is found from “(5-1) to (5-1) + (length-1)”.

○ Type 1: BLAST, BLAT: Released in 1997 and 2002, respectively. Based on k-mer hashing.

○ Type 2. Alignment for bulk RNA-seq

○ 2-1. STAR (Spliced Transcripts Alignment to a Reference): Most commonly used alignment tool. (ref). Uses suffix arrays. Fast but can be memory intensive.

○ --genomeDir: Directory containing the genome index files. If there are no files, run --runMode genomeGenerate first.

○ --outFilterMismatchNmax: Maximum number of mismatches per pair.

○ --outFilterMultimapNmax: Number of allowed multiple alignments per read.

○ --runThreadN: Number of threads.

### STEP 1. Installation
# Method 1 - reference : https://github.com/alexdobin/STAR 
wget https://github.com/alexdobin/STAR/archive/2.7.10b.tar.gz
tar -xzf 2.7.10b.tar.gz
cd STAR-2.7.10b
cd source
make STAR

# Method 2 
sudo apt install rna-star


### STEP 2. STAR genomeGenerate
STAR --runMode genomeGenerate \
     --runThreadN NumberOfThreads \
     --genomeDir /path/to/your/GenomeDir \
     --genomeFastaFiles /path/to/combined_genome.fa \
     --sjdbGTFfile /path/to/combined_genome.gtf 


### STEP 3. STAR run
STAR --genomeDir /path/to/your/GenomeDir \
     --readFilesIn /path/to/your/read1.fastq /path/to/your/read2.fastq \
		 --readFilesCommand zcat \
     --runThreadN NumberOfThreads \
     --outFileNamePrefix /path/to/your/outputPrefix \
     --outSAMtype BAM SortedByCoordinate

# Example     
STAR --genomeDir ./reference --readFilesIn read1.fastq read2.fastq --runThreadN 4 --outFileNamePrefix ./SaveDir --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate

○ 2-2. HISAT, HISAT2: Designed and implemented the first GFM (graph FM index) based on the BWT extension of a graph. It succeeds TopHat and performs genetic variation-aware alignment.

○ HISAT-3N: Used for sequencing techniques that deal with nucleotide conversion, e.g. C to T in bisulfite-seq.

○ 2-3. Bowtie, Bowtie2: Uses Burrows Wheeler Transform (BWT) that utilizes efficient compression technique.

## STEP 1. install

# Debian (e.g., Ubuntu)
sudo apt-get update
sudo apt-get install bowtie2

# Red Hat (e.g., Fedora, CentOS)
sudo yum install bowtie2

# Conda
conda install -c bioconda bowtie2

# macOS 
brew install bowtie2


## STEP 2. Genome Generation
bowtie2-build hg38.fasta human_reference

○ 2-4. Tophat, Tophat2: Uses Burrows Wheeler Transform (BWT).

○ 2-5. BWA (Burrows-Wheeler Aligner): Uses Burrows Wheeler Transform (BWT).

○ 2-6. MUMmer

○ Performs rapid alignments of DNA and protein.

○ Whole genome alignments of prokaryote, fungal, and mammalian genomes.

○ 30 Mb fungal genome takes 1~2 min forward and reverse.

○ Can align based on 6-frame translation.

○ Use suffix tree to perform the alignments.

○ 2-7. CLC

○ 2-8. ContextMap2

○ 2-9. CRAC

○ 2-10. GSNAP

○ 2-11. MapSplice2

○ 2-12. Novoalign

○ 2-13. OLego

○ 2-14. RUM

○ 2-15. SOAPsplice

○ 2-16. Subread

○ 2-17. SOAP, SOAP2: Uses Burrows Wheeler Transform (BWT).

○ 2-18. mr/mrsFast

○ 2-19. Eland

○ 2-20. Bfast

○ 2-21. BarraCUDA

○ 2-22. CASHx

○ 2-23. Mosiak

○ 2-24. Stampy

○ 2-25. SHRiMP

○ 2-26. SeqMap

○ 2-27. SLIDER

○ 2-28. RMAP

○ 2-29. SSAHA

○ 2-30. bamnostic: Operating-system-agnostic. Python-based.

○ Type 3. Pseudo-alignment for bulk RNA-seq

○ Pseudo-alignment simplifies some steps of alignment to handle the time-consuming process more quickly.

○ Pseudo-alignment is less accurate than alignment but faster.

○ 3-1. Kallisto: paper, manual

○ 3-2. Sleuth: blogpost, tutorial

○ 3-3. Salmon: preprint, manual. Transcript level

○ Type 4. Alignment and counting for scRNA-seq

○ 4-1. CellRanger: Uses STAR alignment.

○ Type 5. Alignment and counting for ST data

○ 5-1. SpaceRanger: Limited to Visium. Uses STAR alignment but considers spatial barcoding.

○ Type 6. Long-read seq alignment

○ Mandalorion (Byrne et al., 2017)

○ Minimap, Minimap2 (Heng Li, 2018)

○ SQANTI (Tardaguila et al., 2018)

○ Pinfish (ONT, 2018)

○ StringTie2 (Kovaka et al., 2019)

○ TALON (Wyman et al., 2019)

○ FLAIR (Tang et al., 2020)

○ Bambu (GoekeLab GitHub, 2020)

○ FLAMES (Tian et al., 2020)

○ TALC (Broseus et al., 2020)

○ LongTron (Wilks et al., 2020)

○ uLTRA (Sahlin and Mäkinen, 2020)

○ LIQA (Hu et al., 2021)

○ 2passtools (Parker et al., 2021)

○ MuSTA (Namba et al., 2021)

○ IsoQuant (Prjibelski et al., 2022)

○ ESPRESSO (Gao et al., 2023)

○ Bloom2 (Nip et al., 2023)

○ TASSEL (Kainth et al., 2023)

○ MisER (Liu et al., 2023)

○ Type 7. splice-aware aligner: Aligner that considers RNA-specific events like splicing during alignment.

○ 7-1. bustools

② Process 2. Mapping

○ Definition: Storing sequences in a hash table and using this table to map similar sections between different sequences.

○ Background: All-vs-all sequence comparison consumes astronomical amounts of time, thus it’s necessary to find short, representative hash values for each sequence.

○ General Process:

○ Step 1: Divide the given sequence into several sliding windows.

○ Step 2: Search for all contiguous sequences of k elements within each sliding window.

○ Step 3: Among the k contiguous sequences, the lexicographically smallest sequence is the minimizer for that window.

○ Step 4: Apply a quaternary number system to the minimizer (i.e., A = 0, C = 1, T = 2, G = 3) to translate the minimizer into a specific value.

○ Step 5: Apply a hash function to the minimizer to generate a hash value, aiming to alleviate base distribution asymmetry and increase database efficiency.

○ Step 6: Store each minimizer from every window in the corresponding element of the hash table based on its hash value.

○ Step 7: Generate a set of minimizers from the query sequence (the sequence of interest).

○ Step 8: Generate a set of minimizers from the target sequence (the reference sequence).

○ Step 9: Search for minimizer hits: compare the minimizer sets of the query and target sequences to explore similar sections between the two sequences.

○ Note on Hash Function: The hash function does not necessarily need to be reversible, but hash collision pairs should not be easily discovered.

○ Advantages of the Mapping Process: Speeds up comparisons between different sequences and ensures the pipeline operates robustly even in the presence of noise or many SNPs.

○ Type 1: BLASR

○ Type 2. DALIGNER

○ Type 3: MHAP

○ Type 4: GraphMap

○ Type 5: minimap, minimap2 (mapping algorithms designed for long-read sequencing)

③ Process 3: Error-Correction

○ Definition: Correcting errors from the sequencing process. 0.2% - 0.5% for Illumina sequencing or higher in long read sequencing.

○ Requirement: A specific k-mer in the given sequence has a low frequency, a substitute k-mer with a close Hamming distance exists, and replacing it with the substitute k-mer does not generate any newly defined k-mers regardless of how the segment is defined. In this case, the substitution is applied.

○ Often included within the assembly process.

○ Type 1: pbdagcon

○ Type 2: falcon_sense

○ Type 3: nanocorrect

④ Process 4. assembly

Figure 10. Assembly process

○ Definition: ○ Definition: The process of reconstructing short DNA/RNA fragments, which are partially obtained due to technical limitations, into a graph aligned to the reference genome.

○ Class 1. Ab initio Assembly

○ Requires a reference genome.

○ Requires minimal computing resources.

○ The completeness of the genome and the performance of the aligner affect the accuracy of the assembly.

○ Two-pass alignment: A method for effectively detecting novel splice junctions in ab initio assembly.

○ Step 1. First pass: Align RNA-seq reads to the reference genome.

○ Step 2. Update splice junction information: Collect novel splice junctions detected in the first pass and update the reference.

○ Step 3. Second pass: Realign reads using the updated splice junction information.

○ STAR two-pass alignment

○ Single sample: Use --twopassMode Basic.

○ Multiple samples: Run STAR normally → Collect SJ.out.tab files for each sample → Re-run STAR with --sjdbFileChrStartEnd sj1.tab sj2.tab ....

○ Additional parameter 1. alignIntronMin: Minimum intron size

○ Additional parameter 2. alignSJoverhangMin: Minimum number of nucleotides required to recognize novel splice junctions (i.e., the minimum number of nucleotides that must be aligned between two exons for accurate splice junction mapping)

○ Additional parameter 3. alignSJDBoverhangMin: Minimum number of nucleotides required to recognize known splice junctions (i.e., the minimum number of nucleotides that must be aligned between two exons for accurate splice junction mapping)

○ Class 2. De novo Assembly

○ Does not require a reference genome, eliminating reference bias.

○ Constructs contigs by connecting overlapping reads in a graph structure; contigs are later compared with the reference for naming in the post-assembly stage.

○ Requires substantial computing resources: ~1G RAM for ~1M 76bp reads.

○ Can identify chromosomal aberrations or novel isoforms that are not present in the reference genome.

○ Step 1. Contig: Contiguous sequence constructed by joining reads that have overlapping sequence information.

Figure 11. Contig

○ E has identity identity at only one end, so only leads to inclusion of 3 nodes.

○ Path ABCD includes 4 nodes and is chosen as better path.

○ Step 2. Create an assembly graph

○ 2-1. de Bruijn graph

Figure 12. de Brujin graph

○ Using k-mers: The graph is constructed such that each node overlaps with the previous node by k-1 bases.

○ Problems arise when repetitive sequences are longer than the read length: There are multiple possible ways to connect the left and right sides of the repetitive sequences.

○ In cases with repetitive sequences, large library insert size and mate-pair sequencing can be helpful.

○ In a de Bruijn graph, both strands need to be considered: Therefore, reverse complements must also be taken into account.

○ Advantage 1: Increased efficiency for computing overlaps between sequences

○ Advantage 2: Straightforward collapsing of assembly paths around low k-mer count repeats

○ Advantage 3: Trimming of low k-mer count sequences that terminate

○ 2-2. Needleman-Wunsch graph: Pairwise alignment

○ Stage 1. Construct global alignment graph

Figure 13. Construction of Needleman-Wunsch graph

○ Stage 2. Optimal Graph Path Search: Dynamic Programming. Traceback along high values at the final node (i.e., T-T).

Figure 14. Optimal Graph Path Search in a Needleman-Wunsch Graph

○ Pseudocode

○ 2-3. Smith-Waterman graph: Pairwise alignment

○ Stage 1. Construct a local alignment graph: Assign zero to negative values in the global alignment graph.

○ Stage 2. Optimal graph path search: Dynamic programming. Traceback from the node with the maximum value.

Figure 15. Smith-Waterman graph

○ Pseudocode

○ Example: If a gap occurs, it might be necessary to reconsider it. It feels like the sequence assignment order should be slightly delayed.

Figure 16. Example of Smith-Waterman graph

○ 2-4. Progressive alignment: MSA(multiple sequence alignment)

○ Uses a tree-based strategy where similar sequences are aligned initially, and then additional sequences are aligned to those alignments.

○ The order you align sequences in a progressive multiple sequence alignment is important.

○ 2-5. Iterative alignment: MSA(multiple sequence alignment)

○ Similar to progressive, however further realigns initial sequences to current full MSA.

○ Reduces dependence on initial alignments.

○ 2-6. Hidden Markov model: MSA(multiple sequence alignment)

○ Can leverage profile HMMs to train and iteratively align sequences.

○ 2-7. UPGMA(unweighted pair group method with arithmetic mean): MSA(multiple sequence alignment)

○ 1^st. Calculate pairwise edit distances between all possible pairs of sequences by aligning them together (Smith-Waterman / Needleman-Wunsch)

○ Edit distance: Hamming distance, Levenshtein distance

○ 2^nd. Create a hierarchical order using one of agglomerative hierarchical clusterings, UPGMA.

○ 3^rd. Progressive align sequences (and alignments) together by traversing up the tree.

○ Step 3. Graph traverse: Search algorithm

○ Eulerian path: A path in a graph that visits each edge exactly once. If there are repeating sequences, two nodes can be connected by more than one edge. Traversing the entire graph is not needed. Both Breadth-First Search (BFS) and Depth-First Search (DFS) are possible.

○ Euler-SR

○ Velvet: Fast for small to medium genomes, ~50 Mb or less. The “Tour Bus” algorithm, which utilizes Dijkstra algorithm, in Velvet removes ‘bubbles’ in graph using coverage, sequence identity, and length thresholds.

○ Abyss: Fast for small to medium genomes, ~50 Mb or less.

○ SSAKE (Warren et al. 2007): short-read assembler

○ VCAKE (Jeck et al. 2007): short-read assembler

○ miniasm: Long-read seq assembler. Uses the distances in nucleotide sequences from multiple mappings to form corresponding edges.

○ In this case, it can be said that the ABCD of the query sequence corresponds to the ABCD of target sequence 3.

○ Step 4. Clean the assembly graph: If the corresponding sections are P → Q → R, it can be simplified to P → R.

○ Step 5. Generate unitigs.

○ During this process, information such as position and mismatch is generated.

○ The result of the assembly process generates a .bam file.

○ Class 3. Combined strategy

○ If reference genome is of poor quality or from distantly related species.

○ Type 1. wgs-assembler

○ Type 2. Falcon

○ Type 3. ra-integrate

○ Type 4. miniasm (an assembly algorithm for long-read sequencing)

○ Type 5. Velvet: Used to know the coordinate and feature of the breakpoint junction sequence.

○ Type 6. SPAdes: Genome assembly

Assembler	De novo?	Parallelism	Support for paired-end reads?	Support for stranded reads?	Support for multiple insert sizes?	Outputs transcript counts?
G-Mo.R-Se	No	None	No	No	No	No
Cufflinks	No	MP	Yes	Yes	Yes	Yes
Scripture	No	None	Yes	Yes	Yes	Yes
ERANGE	No	None	Yes	Yes	Yes	Yes
Multiple-k	Yes	None	Yes	Yes	Yes	No
Rnnotator	Yes	MP	Yes	Yes	Yes	Yes
Trans-ABySS	Yes	MPI	Yes	Yes	No	Yes
Oases	Yes	MP	Yes	Yes	Yes	No
Trinity	Yes	MP	Yes	Yes	No	Yes

Table 3. Types of RNA-assembly

⑤ Process 5. Consensus polish

○ Type 1. Quiver

○ Type 2. nanopolish

⑥ Application 1. Reference matching

○ Definition: Process of determining which reference a read comes from when using multiple references of different species.

○ Special algorithms are used to correct errors due to inter-species homology.

○ Typically, multiple .bam files mapped to different references are used to find more accurate references for each transcript.

○ This discussion implies that to accurately and comprehensively map RNAs from a given tissue, one should consider one or more optimal combinations of references.

○ Type 1. Individual alignment

○ 1-1. Freemuxlet: Demultiplexing cells derived from multiple individuals using SNPs.

○ Type 2. Xenograft alignment: Matching references for each transcript using .bam files from graft and host references.

○ 2-1. XenofiltR: Takes .bam files from host and graft references. Generates .bam and .bam.bai files with only graft reads remaining.

○ 2-2. BAMCMP: Separates graft and host reads in xenograft data. Divides into graft-only, host-only, ambiguous, unmapped classes.

○ 2-3. disambiguate

○ 2-4. Xenomake : Solely for spatial transcriptomics.

○ Type 3. Using multiple references in pseudo-alignment (ref)

○ Simple approach of constructing integrated references like a single species’ reference results in lower accuracy.

○ Type 4. Microbiome Alignment

○ Bacterial References: SILVA, RDP (Ribosomal Database Project), Greengenes, RefSeq

○ Fungal References: UNITE, EUKARYOME

○ 4-1. BLAST: Slow but the most accurate

○ 4-2. VSEARCH: Rarely used nowadays

○ 4-3. MAFFT: For MSA (multiple sequence alignment)

○ 4-4. DECIPHER: For MSA (multiple sequence alignment)

○ 4-5. Pathseq: Performs filtering, alignment, and abundance estimation in mixed human-microbe metagenomics data using GATK (Genome Analysis Toolkit).

○ 4-6. Kraken

⑦ Application 2. Workflow Build

○ Type 1. SnakeMake : A tool for creating custom workflows independently.

○ Type 2. SpaceMake : Integrated workflow to process various spatial transcriptomics data.

⑷ Step 3. (optional) QC

① SAMtools: Possible to use SAMtools view. Filtering by flags is possible. (ref)

⑸ Step 4. sorting: Sorting BAM files along the alignment axis.

① Picard (release, trouble- shooting)

# How to install java on Ubuntu/Debian
sudo apt update
sudo apt install default-jdk

java -jar picard.jar SortSam \
      INPUT=input.bam \
      OUTPUT=sorted_output.bam \
      SORT_ORDER=coordinate

② SAMtools

# install in Linux
sudo apt update
sudo apt install samtools

# execution of samtools
samtools sort -o sorted_file.bam input_file.bam

③ name-sorted: Typically, sorting is done based on the read name.

④ coordinate-sorted: For example, the possorted_genome_bam.bam file created in SpaceRanger is position-sorted.

⑹ Step 5. marking: Marking duplicates arising from PCR amplification.

① Generally, DNA-seq involves removing duplicates, while RNA-seq does not: In RNA-seq, the same sequence may repeatedly appear not only due to technical duplicates but also due to high expressing transcripts or short genes. Removing these biological duplicates can reduce the dynamic range of the data or decrease statistical power.

② Picard MarkDuplicates

java -jar picard.jar MarkDuplicates INPUT=sorted_file.bam OUTPUT=marked_duplicates.bam METRICS_FILE=metrics.txt

③ SAMtools rmdup

samtools markdup -r -s sorted.bam marked_duplicates.bam

○ -s: single-end read. If paired-end read, remove this part.

④ Trimmomatic

⑤ Trim Galore!

⑥ fastp

⑺ Step 6. indexing : .bam.bai file is generated from .bam file.

① SAMtools

samtools index marked_duplicates.bam

⑻ Step 7. Count: Determining which feature (e.g., gene, exon) a read originated from when there is a read.

Figure 17. Counting process (from HTSeq)

① Algorithms determining whether a read came from gene A or gene B when there is a read.

② Type 1. HTSeq

○ A tool based on gene overlap.

○ Most commonly used counting algorithm.

○ Default is union.

○ Command example:

# install
pip install HTSeq

htseq-count -f bam -r pos -s no -i gene_id -t exon mouse_sample.bam Mus_musculus.GRCm38.99.gtf > output.count

○ -f bam: Specifies that the input file is a .bam file.

○ -r pos: Ensures that the feature order in the output file matches the order in the reference.

○ -s no: -s indicates whether the sequencing data is strand-specific. -s no specifies that the data is not strand-specific, meaning the reads’ transcriptional direction (strand) is not considered. For strand-specific data, you should use -s yes (forward strand) or -s reverse (reverse strand).

○ -i gene_id: Uses the gene_id attribute from the GTF file as the reference for each feature. transcript_id is also possible.

○ -t exon: Specifies feature type. Default is exon.

○ Mus_musculus.GRCm38.99.gtf: Reference name. Can use .gff file instead of .gtf.

○ > output.txt: Outputs count result to output.txt file.

○ Result example:

○ Simple counting algorithm without assembly process.

○ Issue 1. When there are multiple isoforms, shared reads are either consistently counted as +1 or discarded, leading to inaccurate counts.

○ Issue 2: Decreased accuracy when counting at the transcript level compared to the gene level

○ Gene count > Sum of isoform counts that constitute each gene ( ∵ ambiguous reads)

○ Using genome-aligned data for transcript alignment without separate assembly can be inaccurate

○ Due to high similarity between transcripts, distinguishing by sequencing technology can be extremely challenging

③ Type 2. RSEM

○ Utilizes the EM (expectation maximization) technique to accurately predict counts.

○ Isoform quantification is possible.

○ GDCPortal provides RSEM count values.

○ Step 1: RSEM installation

# step 1.
git clone https://github.com/deweylab/RSEM.git

# step 2. 
cd RSEM

# step 3.
make 

# step 4.
make install

○ Step 2: Prepare reference files

rsem-prepare-reference --gtf annotation.gtf genome.fa reference_name

○ Step 3: Expression estimation

# case 1. single-end
rsem-calculate-expression --single-end --alignments -p 4 aligned_reads.sam reference_name sample_output

# case 2. paired-end
rsem-calculate-expression --paired-end --alignments -p 4 aligned_reads.bam reference_name sample_output

④ Type 3. StringTie

# install
conda install -c bioconda stringtie

# count
stringtie sorted_bam.bam -o output.gtf -G reference.gtf -A gene_abundances.tsv

○ genome-based: (Comment) StringTie’s method for transcript-level abundance is not straightforward

○ Includes assembly process.

○ Takes sorted_bam.bam file from genome-aligned reads as input, outputs separate output.gtf file.

○ Uses novel network flow algorithm

○ Example results

⑤ Type 4. featureCounts (install)

./subread-2.0.6-Linux-x86_64/bin/featureCounts -a GRCh38.gtf -o counts.txt possorted_genome_bam.bam

○ Example results

⑥ Type 5. Cufflinks / Cuffdiff

⑦ Type 6. Tuxedo: Isoform quantification is possible.

○ Step 1. Cufflinks: Assembles transcript per sample and explains observed reads with minimum number of isoforms.

○ Step 2. Merge GFT files.

○ Step 3. Cuffquant: Conducts transcript quantification.

⑧ Type 7. Hisat2: Genome-based

⑨ Type 8. QoRTs

⑩ Type 9. eXpress

○ Utilizes the EM (expectation maximization) technique. Transcriptome-based.

○ Isoform quantification is possible.

⑪ Type 10. bowtie2: Transcriptome-based

⑫ Type 11. TIGAR2

○ Bayesian inference for transcript quantification.

○ Isoform quantification is possible.

⑬ Type 12. SALMON

○ Quasi-mapping, not base-to-base alignment + Isoform quantification

○ For gene-level analysis, tximport is used.

○ Step 1. ab initio or de novo assembly

○ Step 2. Construct a HASH table: Index the positions of distinct k-mers.

○ Step 3. Build a suffix array: Annotate the suffixes of k-mers.

○ Step 4. Quasi-mapping

○ 4-1. Scan the read from left to right: Scan until a k-mer is found in the hash table. Identify the suffixes of the detected k-mer.

○ 4-2. Identify the MMP (Minimum Mapping Position): Find the longest matching read sequence with exact matches to determine the MMP.

○ 4-3. Identify the NIP (Next In Position) assuming mismatches: Account for sequencing errors or natural variations in the read by skipping one k-mer to identify the NIP. Instead of stopping where the current match is disrupted, search for the next matching k-mer to increase the likelihood of mapping the entire read.

○ 4-4. Repeat steps 4-1 and 4-2 until the end of the read is reached to complete quasi-mapping.

○ Step 5. Quantify transcript abundance using the EM (Expectation-Maximization) algorithm.

⑭ Type 13. Ballgown (ref): Provides gene count, transcript count, DEG analysis results, etc.

○ Step 1: Prepare .ctab files

○ Method 1: TopHat2 + StringTie

○ Method 2: TopHat2 + Cufflinks + Tablemaker

○ StringTie command example: Generate .ctab files using the -B argument

stringtie -e -B -p 8 -G ref.gtf -l sample -o output.gtf aligned_reads.bam

○ Step 2: Check directory structure

ballgown/
├── sample1/
│   ├── e_data.ctab
│   ├── e2t.ctab
│   ├── i_data.ctab
│   ├── i2t.ctab
│   ├── t_data.ctab
│   └── output.gtf
├── sample2/
│   ├── e_data.ctab
│   ├── e2t.ctab
│   ├── i_data.ctab
│   ├── i2t.ctab
│   ├── t_data.ctab
│   └── output.gtf
├── sample3/
│   ├── e_data.ctab
│   ├── e2t.ctab
│   ├── i_data.ctab
│   ├── i2t.ctab
│   ├── t_data.ctab
│   └── output.gtf
...

○ Step 3: Execute Ballgown: Can be run with R

library(ballgown)
bg = ballgown(dataDir = "ballgown", samplePattern = "sample", meas = "all")
gene_expression = gexpr(bg)
transcript_expression = texpr(bg, 'all')

○ Example results

⑮ Type 14. Pathseq: Performs filtering, alignment, abundance estimation on mixed metagenomics data of human and microbes using GATK (Genome Analysis Toolkit).

⑯ Type 15. CellRanger

○ scRNA-seq alignment and count

⑰ Type 16. SpaceRanger

○ ST alignment and count

○ Limited to Visium. Uses STAR alignment and considers spatial barcoding additionally

⑼ Step 8. (optional) Error correction: It is possible to correct mapping errors during the assembly process.

① pbdagcon

② falcon_sense

③ nanocorrect

⑽ Step 9. (optional) Variant calling: Can investigate genetic variations like SNPs or indels. Variants are saved as a VCF file.

① Overview: A typical human genome differs from the reference human genome at 4.1-5 million sites

○ Majority (>99.9%) are SNPs and short indels.

○ Contains 2,100 to 2,500 structural variants affecting ~20 Mb of sequence and comprising (approximately):

○ 1000 large indels

○ 160 copy number variants

○ 915 Alu insertions, 128 L1 insertions, 51 SVA insertions

○ 4 numts

○ 10 inversions

○ Majority of variants in a single genome are common.

○ Only 40,000 to 200,000 per genome (1-4%) have a frequency <0.5%.

② Pipeline

○ Methods: Read pair mapping, read depth analysis, split read alignment, sequence assembly

Figure 18. Methods for variant calling

○ A BED file or VCF file is generated from a BAM file.

○ It is followed by the creation of a BEDgraph file (a graph data structure file composed of BED), a Wiggle file (a file comparing with the control group), and a bigWig file (a compressed binary version of the Wiggle file).

③ Types

○ GATK (Genome Analysis ToolKit): HaplotypeCaller in GATK is commonly used. Also, there are UnifiedGenotyper and Mutect2.

○ Freebayes

○ SAMtools mpileup

○ CaVEMan (Cancer Variants Through Expectation Maximization): Used for somatic substitution calling.

○ Pindel: Used for Indel calling.

○ BRASS (BReakpoint AnalySiS): Used for structural variant calling.

○ MACS, MACS2: Variant calling pipeline for ChIP-seq and ATAC-seq.

○ SPP

○ GWAVA (Genome-Wide Annotation of Variants)

○ DeepSea: Predicting noncoding variants

○ DanQ: Quantifying the functionality of DNA sequences using CNNs and RNNs

○ DeepFun: Predicting regulatory variants using CNNs

○ DeepC: Predicting 3D genome folding

○ Akita: Predicting 3D genome folding

○ Zinba-cat

○ Pepr

○ ANNOVAR: Variant annotation (coding / non-coding / chromatin mark)

○ PLINK

○ LD (Linkage Disequilibrium) Calculation

○ Genetic Relatedness Calculation: If genetic relatedness exists, genetic independence is compromised, leading to an overestimation of effective size or exaggeration of significance.

○ Supports various input and output formats (including VCF/dosage imputed data)

○ Commonly used formats:

○ .bed (a binary file storing the genotype matrix), .fam, .bim: All with the same prefix

○ .pgen (a binary file), .psam, .pvar

○ vcftools

○ Designed to handle VCF (Variant Call Format) files generated from sequencing projects.

○ Can output to PLINK format.

④ Application 1. Genotype clustering

○ GenCall (GenomeStudio from Illumina)

○ Between-sample model, one marker at a time

○ Assign genotypes to the nearest cluster

○ Can customize prior clusters

○ GenoSNP

○ Within-sample model, all markers in a sample

○ Using a variational Bayes EM model

○ Can work well for rare variant

○ Can be parallelized

○ optiCall

○ Mixture of between- and within-sample models

○ EM algorithm to fit mixture of t-distributions

○ zCall

○ Post-processing tool

○ Partitioning the X/Y coordinate according to the mean/variance of homozygote clusters for common variants, then recall the rare variants

⑾ Step 10. (optional) Post-hoc ** **transcript-to-gene conversion: When having count data for transcript IDs, it might be necessary to collapse multiple transcript IDs into one gene ID (e.g., for GO analysis)

① Example: GRCh38 (human) GFF file contains the following information

ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,GenBank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2

② Post-conversion methods

○ Method 1. RefSeq

○ Method 2. UCSC knowngene

○ Method 3. Ensembl: manual

○ Method 4. GENCODE

○ Method 5. Useful functions available in R (for human, ref)

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")  
ensembl_transcript_to_gene <- function(transcript_ids){
  # reference : https://support.bioconductor.org/p/106253/#106256

  res <- getBM(attributes = c('ensembl_transcript_id_version', 
                              'ensembl_gene_id', 
                              'external_transcript_name',
                              'external_gene_name'),
               filters = 'ensembl_transcript_id_version', 
               values = transcript_ids,
               mart = mart)

  if(dim(res)[1] == 0){
    return("")
  }	

  return(res[, 'external_gene_name'])
}

5. `QC 5.` Normalization

⑴ Overview

① Definition: Correction for biases where RNA reads do not accurately reflect gene expression due to technical limitations

② In essence, correcting systemic batch effects like library size

⑵ Type 1. library size normalization (Depth-based normalization)

① When comparing different samples, divide each sample by a normalization factor to adjust for the total RNA transcripts count

○ In other words, to compare specific gene expression between samples

○ Sequencing depth, a limitation specific to sequencing machines, can be referred to as library size

② 1-1. RPM (reads per million mapped reads) or CPM (counts per million mapped reads)

○ Divide each gene count by the total count of the sample and multiply by 10^6

○ R code: Maintains equal library size, doesn’t use methods like TMM

library(edgeR)
raw_counts_matrix = ?
group = ?
DEGL <- DGEList(counts=raw_counts_matrix, group=group)

# Keep the raw library size
DEGL_cpm <- calcNormFactors(DEGL, method = "none")

# Calculate the cpm
cpm <- cpm(DEGL_cpm, log = FALSE, normalized.lib.sizes=TRUE)

# Calculate the log cpm
log_cpm <- cpm(DEGL_cpm, log = TRUE, normalized.lib.sizes=TRUE)

#Check out the cpm normalized matrix and log cpm normalized matrix
head(cpm)
head(log_cpm)

③ 1-2. TMM (trimmed mean of M-values) (ref)

○ Definition: Normalize given read counts by dividing by the total read count of the sample + Trimming

○ Proposed by Robinson & Oshlack (2010) (ref)

○ Since trimming is performed, the library size does not remain constant.

○ Significance 1. Corrects the bias where read count does not accurately represent actual gene activity and is proportional to sequencing depth(ref)

○ Condition: Ensure genes with the same expression levels in both samples are not detected as DEGs

○ Assumption: Majority of genes are not differentially expressed (consistent with author’s experience)

○ Thought experiment

○ Sample A is a mixture of human + mouse RNA, Sample B is specifically human RNA from Sample A, both samples having equal human and mouse RNA counts

○ When depths are equal, human RNA reads in A would be exactly half of B’s human RNA reads: reads in A are distributed over genes twice as much

○ To adjust each gene’s RNA read count in A, multiply by a factor (normalization factor) twice that of B’s

○ This assumption leads to batch + sample effect ≃ batch effect, making the normalization method applicable only in this case

○ Significance 2. Determining total RNA production Sk is difficult, but calculating the ratio Sk / Sk’ for two samples is relatively easy

○ Brief definition: Divide given read count by the sample’s total count

○ Example

○ The following depicts differential gene expression of gene g between Sample 1 and Sample 2

○ 1^st. Symbol definitions

○ L_g: Length of gene g

○ μ_gk: Actual transcript count of gene g in sample k. Represents expression level. Related to population

○ N_k: Total transcript count in sample k

○ S_k: RNA count in sample k

○ S_k / N_k: Average RNA count per transcript

○ Y_gk: Observed transcript count of gene g in sample k. Related to the sample population

○ M_g: Gene-wise log-fold-change

○ A_g: Absolute expression level

○ S_k / S_r: Scaling factor to divide by in sample k. Proportional to S_k. Refer to the thought experiment above.

○ TMM: Normalization factor

○ 2^nd. Remove genes with expression of 0

○ 3^rd. Trimming: The aspect that is most fundamentally different from the RPM or CPM.

○ Trimmed mean: Average of data excluding top x% and bottom x%

○ Doubly trimmed: Trim based on log-fold-change Mrgk and absolute intensity Ag

○ Initial researchers trimmed 30% each way by Mrgk and 5% each way by Ag

○ 4^th. Divide by TMM for sample k

○ w^r_gk: Larger weights for genes with low expression to prevent distortion

○ When N_k = N_k’ and Y_gk = 2 × Y_gr, TMM is approximately 2

○ R code: Change library size with TMM and then calculate CPM

library(edgeR)
raw_counts_matrix = ?
group = ?
DEGL <- DGEList(counts=raw_counts_matrix, group=group)

# Calculate normalization factors using TMM method to align columns of a count matrix
DEGL_TMM <- calcNormFactors(DEGL, method="TMM")

# Calculate the cpm with the TMM normalized library
TMM <- cpm(DEGL_TMM, log = FALSE, normalized.lib.sizes=TRUE

# Check out the cpm of TMM normalization
head(TMM)

○ Application 1. GeTMM method

④ 1-3. RLE (relative log estimate)

○ Proposed by Anders & Huber (2010).

○ The normalization method adopted by default in the R packages DESeq and DESeq2.

○ Step 1. Creation of a pseudo-reference (median library): The geometric mean is taken across all samples.

○ X: raw count

○ g: gene

○ k: condition

○ r: replicate

○ Step 2. The median ratio of each sample to the pseudo-reference is used as the scale factor (size factor).

○ Step 3. The gene count value of that sample divided by the scale factor is considered the normalized count for that gene.

○ R code

library(edgeR)
raw_counts_matrix = ?
group = ?
DEGL <- DGEList(counts=raw_counts_matrix, group=group)

# Calculate normalization factors using RLE method to align columns of a count matrix
DEGL_RLE <- calcNormFactors(DEGL, method="RLE")

# Calculate the cpm with the RLE normalized library
RLE <- cpm(DEGL_RLE, log = FALSE, normalized.lib.sizes=TRUE)

# Check out the TMM normalized result
head(RLE)

⑤ 1-4. UQ (Upper Quartile) Normalization

○ A method to adjust expression levels at specific quantiles to be the same when analyzing two transcriptomes.

○ Proposed by Bullard et al. (2010).

○ Generally uses the upper 75% (lower 25%) quartile values as variables (cf. Q3-norm).

○ Mostly used in microarray data.

○ R Code

library(edgeR)
raw_counts_matrix = ?
group = ?
DEGL <- DGEList(counts=raw_counts_matrix, group=group)

# Calculate normalization factors using UQ method to align columns of a count matrix
DEGL_UQ <- calcNormFactors(DEGL, method="upperquartile")

# Calculate the cpm with the UQ normalized library
UQ <- cpm(DEGL_UQ, log = FALSE, normalized.lib.sizes=TRUE)
# Check out the UQ normalized result
head(UQ)

⑥ 1-5. SpaNorm (GitHub)

○ Library size normalization considering spatial region-specific library size bias

○ Negative binomial distribution-based

⑶ Type 2. Gene Length Normalization

① Definition: A method of correcting for gene length when comparing the expression of different genes (or exons, isoforms) within a single sample.

○ Here, gene length refers to the effective length, which is the actual gene length minus the read length.

○ Gene length correction: Gene counts are proportional to gene length, so count values are divided by gene length.

○ Gene length normalization is criticized for not properly reflecting data characteristics, with decreasing importance of metrics like FPKM, TPM.

② 2-1. RPKM (Reads Per Kilobase of Transcript per Million Mapped Reads)

○ Formula: RPKM for gene i is represented as Q reads and ℓ exon length.

○ Used to compare gene expression among different genes within a sample.

○ Example: 25,000 reads in gene / (0.5 kb gene × 40 million reads) = 1,250

③ 2-2. FPKM (Fragments Per Kilobase of Exon per Million Mapped Fragments)

○ Formula: FPKM for gene i is represented as q reads and ℓ exon length.

○ Fragment refers to a pair of reads in paired-end sequencing.

○ Similar to RPKM, but used only for paired-end RNA-seq (Trapnell et al., 2010).

○ Example: 25,000 paired end fragments in gene / (0.5 kb gene × 40 million paired end reads) = 1,250

④ 2-3. TPM (Transcripts Per Kilobase Million): Proposed by Li et al. (2010)

○ Definition: Normalized based on how many molecule from this gene for each 1M RNA molecules in the sample.

○ The only difference from RPKM: When calculating TPM, you normalize for gene length first, and then normalize for sequencing depth secondly.

○ Relatively easy to use due to length-based normalization, more common than CPM.

○ Summing TPM values in a single library gives 1,000,000, allowing comparison between different samples.

○ Relation between TPM and RPKM: For the number of genes n,

○ TPM is highly correlated with RPKM (or FPKM) results

⑷ Type 3. Log-Transformation

① Purpose 1: For genes with high count values, the actual activity of those genes may be exaggerated in interpretation.

② Purpose 2: When there are both very small and very large values, it is necessary to adjust the scale to be similar.

③ To address the above issues, the results of log-transforming the normalized counts are considered as gene expression values.

④ There are various transformations such as power transformation, but log transformation is the state-of-the-art (SOTA) (ref).

⑸ Type 4. scaling (e.g., z-score transformation)

① For each patient and each sample, the range of gene expression values varies. To compare them rationally, the range of values is adjusted.

② For example: In the case of the TCGA, Seurat pipeline, the maximum gene expression value is set to around 10 → the minimum value generally becomes negative.

⑹ Type 5. Feature selection, Cell selection

⑺ Type 6. Lambda GC: SNP Data Normalization Method

① Step 1. Analyze the correlation between each SNP and the phenotype.

② Step 2. Convert the p-value to a z-score. For example, in R: z = qnorm(p / 2)

③ Step 3. Calculate the chi-square statistic. For example, in R: c = z^2

④ Step 4. Calculate the median of the chi-square statistics for all SNPs.

⑤ Step 5. Lambda GC: Divide the above median by 0.455 (which is the median of χ² with df = 1). The expected value is 1.

⑥ Step 6. Divide the actual χ² value for each SNP by lambda GC, and then reverse to calculate the p-value.

⑻ R: Obtaining normalized expression using the Seurat package.

library(dplyr)
library(Seurat)

pbmc.data <- Read10X(data.dir = "./Spatial_matrix/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 100)
pbmc[["percent.mt"]] <- PercentageFeatureSet(object = pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 500)
pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(object = pbmc, selection.method = "vst", nfeatures = 2000) 

### Download for Normalized Count Data
write.csv(pbmc@assays$RNA@data, "./normalized_counts.csv")


all.genes <- rownames(x = pbmc) 
pbmc <- ScaleData(object = pbmc, features = all.genes) 

### Download for Scaled Expression Data (also known as TCGA scale) 
write.csv(pbmc@assays$RNA@scale.data, "./scaled_expression.csv")

⑼ Python: Obtaining normalized expression using the scanpy package (ref)

import scanpy as sc
import pandas as pd

tissue_dir = './'

adata = sc.read_visium(tissue_dir) 
adata.var_names_make_unique()
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)

pd.DataFrame(adata.X).to_csv('normalized_expression.csv')

⑽ Paper Representation

○ The count data were normalized by log2-transformation of counts per million (CPM) + 1 pseudocount, and scatter plots were generated for each pair of consecutive section using the ggplot2 package (28) in R Studio (versioin 1.1.453). The built-in stats package was used to compute Pearson correlations. (ref)

6. `QC 6.` Batch Effect

⑴ Overview

① Background: Apart from experimental variables, batch effects can influence results between experimental and control groups.

② Batch Effect: Experimental outcomes affected by factors other than biological variables. Examples include:

○ Sequencing dates

○ Researchers performing sequencing

○ Sequencing equipment

○ Protocols

③ Systemic batch effects like library size can be corrected through normalization.

④ Batch effect removal process referred to as removing non-biological batch effects, often using regression analysis.

⑵ Batch Effect Removal

① This is particularly important in DEG analysis through comparison between samples.

② Caution

    batch condition
    <factor> <factor>
        1        A 
        1        A
        1        B
        1        B
        2        C
        2        C

○ Batch effects cannot be theoretically removed if a specific condition applies to a specific batch.

○ However, even in this case, it is possible to create an incomplete regression model by distinguishing between batch effect and covariate to remove the batch effect.

③ Method 1: limma::removeBatchEffect: Removes batch effects by inputting normalized expression matrix and batch information into a linear model. (ref)

④ Method 2: sva::ComBat (Johnson et al., 2007): Empirical Bayes-based method for removing batch effects when batch information is known. (ref, ref, ref). Uses linear models and empirical Bayes shrinkage.

# Import the sva package
library(sva)

# Create batch vector
batch <- c(1,2,3,1,2,3)

# Apply parametric empirical Bayes frameworks adjustment to remove the batch effects
combat_edate_par = ComBat(dat=TMM, batch=batch, mod=NULL, par.prior=TRUE, prior.plots=TRUE)

# Apply non-parametric empirical Bayes frameworks adjustment to remove the batch effects
combat_edata_non_par = ComBat(dat= TMM, batch=batch, mod=NULL, par.prior=FALSE, mean.only=TRUE)

# Check out the adjusted expression profiles
head(combat_edate_par)
head(combat_edate_non_par)

⑤ Method 3: ComBat_seq (Zhang et al., 2020): Takes raw count matrix and outputs adjusted count matrix. Batch information should be known.

library(sva)

# Create batch vector
batch <- c(1,2,3,1,2,3)

# Include group condition
combat_seq_with_group <- ComBat_seq(raw_counts_matrix, batch=batch, group=group, full_mod=TRUE)
# Without group condition
combat_seq_without_group <- ComBat_seq(raw_counts_matrix, batch=batch, group=NULL, full_mod=FALSE)

# Check out the adjusted expression profiles
head(combat_seq_with_group)
head(combat_seq_without_group)

⑥ Method 4: SVA seq (Leek, 2014): Functions even without known batch information.

⑦ Method 5: RUV seq (Risso et al., 2014)

⑶ Measurement of the accuracy of batch effect removal

① k-nearest-neighbor batch effect test (kBET)

② ASW across batches

③ k-nearest-neighbor (kNN) graph connectivity

④ batch removal using PCA regression

⑷ Application 1: Clustering and Batch Effect

① Clustering can be used for batch effect removal by dividing experimental and control groups within major clusters (e.g., cell type clusters) to obtain DEGs.

⑸ Application 2: Data Integration

① Merging data from different batches or modalities, with consideration of batch correction effects.

② Modifying two groups to have the same characteristics.

○ Example 1: Train the model so that the clustering patterns are not affected by batch effects.

○ Example 2: Train the model to minimize the discriminator’s performance, which distinguishes data based on batch-specific effects (i.e., to make the data indistinguishable after integration).

Figure 19. Types of Data Integration

③ Method 1. MNN (Mutual Nearest Neighbors) (git): Utilized in R scater or Python scanpy.

④ Method 2. Seurat V1 and V2: Uses CCA (Canonical Correlation Analysis)

○ Has a single data slot for expression values that stores either raw counts or normalized data.

○ Designed primarily for single modalities.

○ If the batch effect is large or there are few common cell subsets, integration is not effective.

○ CCA-based Data Integration: This method aggressively aligns the two groups.

○ RPCA-based Data Integration: This method performs a less aggressive alignment of the two groups, making it useful when the tissue characteristics of the two datasets are significantly different.

⑤ Method 3. Seurat V3

○ Step 1. Perform dimensional reduction on both the reference and query datasets using diagonalized CCA.

○ Step 2. Apply L2-norm to the canonical correlation vector.

○ Step 3. Compute MNN (Mutual Nearest Neighbors) to explore the low-dimensional representation shared by both datasets: This involves identifying and linking cells that have the same context within the dataset. These linked cell pairs are referred to as anchors.

○ Step 4. Score the anchors and remove incorrectly linked low-confidence anchors: The score is defined as shared neighbor overlap.

○ Step 5. Define the weighted distance: Given the i-th anchor a_i, query cell c, and anchor score S_i,

○ Step 6. Apply a Gaussian kernel to the weighted distance: The default standard deviation (sd) is 1.

○ Step 7. Normalize all k.weighted anchors.

○ Step 8. Compute the integration matrix B = Y[:, a] - X[:, a] for all anchor cell pairs, where X is the reference expression matrix and Y is the query expression matrix.

○ Step 9. Compute the transformation matrix C using the weight matrix W and the integration matrix B.

○ Step 10. Subtract the transformation matrix C from the original query expression matrix Y to obtain the integrated expression matrix.

○ Step 11. Label transfer: Compute the label prediction P_l using the binarized anchor classification matrix L and the weight matrix W.

⑥ Method 4. Seurat V5

○ Has fast and more efficient anchoring with improved memory handling.

○ Allows for integration of not just modalities measuring nucleic acids but also others like proteins.

⑦ Method 5. BBKNN (Batch-Balanced k-Nearest Neighbor): Using PCA

⑧ Method 6. mnnCorrect

⑨ Method 7. Scanorama: Used in Python scanorama. Using singular value decomposition

⑩ Method 8. Conos : Used in R conos.

⑪ Method 9. scArches: Utilizes transfer learning.

⑫ Method 10. DESC

⑬ Method 11. fastMNN (batchelor): Using PCA

⑭ Method 12. Harmony: Using PCA. Harmony does not provide batch effect-corrected normalized expression, but it does provide batch effect-corrected low-dimensional embeddings that can be used for downstream analysis.

⑮ Method 13. LIGER: A general approach for integrating single-cell transcriptomic, epigenomic and spatial transcriptomic data. factor-based. Using integrative non-negative matrix factorization

⑯ Method 14. SAUCIE

⑰ Method 15. scANVI: conditional VAE (ref)

⑱ Method 16. scGen: conditional VAE (ref)

⑲ Method 17. scVI: conditional VAE (ref)

○ Pros: Provides batch-effect-corrected normalized expression and includes a reference code for identifying differentially expressed genes (DEGs). (ref)

○ Cons 1. Not applicable to epigenetics data (e.g., scATAC-seq).

○ Cons 2. The separately provided DEG identification code is slow and inaccurate. The DEG results can sometimes be completely reversed due to slight differences in batch-effect correction between all 10 datasets and only 2 of them.

○ Recommendation: It is recommended to use scVI normalized expression and scanpy.tl.rank_genes_groups for DEG identification.

⑳ Method 18. TrVae: conditional VAE (ref)

㉑ Method 19. TrVaep

㉒ Method 20. scIB: In addition to merging multiple integration tools, benchmarking is also provided.

○ Integration function: BBKNN, ComBat, DESC, Harmony, MNN, SAUCIE, Scanorama, scANVI, scGen, scVI, trVAE

○ Benchmarking metric: ARI, ASW, F1, mutual score, etc.

㉓ Method 21. iMAP

㉔ Method 22. INSCT

㉕ Method 23. scDML

㉖ Method 24. scDREAMER, scDREAMER-Sup: It even enables inter-species integration.

㉗ Method 25. SATURN: Integration of datasets from different species

㉘ Method 26. GLUE (Cao and Gao, 2023), SpatialGlue: A tool for integrating unpaired data from different platforms such as scRNA-seq, scATAC-seq, and snmC-seq.

○ Step 1. Embed cell × feature data from each omics into cell × embedding, where each cell is represented in a common dimension.

○ Step 2. To represent the relationships between omics, use a knowledge-based guidance graph to represent regulome × embedding, where each regulome is expressed as a vector in the same dimension as in Step 1.

○ Step 3. Construct an autoencoder by feeding the inputs from Step 1 and Step 2 into the encoder, with the decoder outputting the integrated data.

○ Step 4. Adversarial learning: Introduce a discriminator to identify platform-specific effects, and search for autoencoder parameters that minimize the discriminator’s performance (thus facilitating better integration).

㉙ Method 27. Seurat Anchor (Stuart et al., 2019): factor-based

㉚ Method 28. DC3 (Zeng et al., 2019): factor-based

㉛ Method 29. coupled NMF (Duren et al., 2018): factor-based

㉛ Method 30. SCOT (Demetci et al., 2022): topology-based

㉛ Method 31. UnionCom (Cao et al., 2020): topology-based

㉛ Method 32. Panoma (Cao et al., 2022): topology-based

㉛ Method 33. MrVI

㉛ Method 34. scGCN: Supports multi-omics

㉛ Method 35. scETM

㉛ Method 36. MultiVI

㉛ Method 37. Biolord

⑹ 2-1. Merge between scRNA-seq or ST

① Seurat: Except for clustering, most methods require switching DefaultAssay from integrated to RNA/Spatial/SCT, etc

○ FindMarkers

○ FindAllMarkers

○ FeaturePlot

○ SpatialFeaturePlot

○ DotPlot

○ DimPlot

○ SpatialDimPlot

○ genewise correlation

○ trajectory analysis

② BANKSY: Enables data integration between spatial multiomics

⑺ 2-2. Merging scRNA-seq data with other modalities

Figure 20. Merging scRNA-seq data

Table 4. Analysis methods for matched data

Table 5. Analysis methods for unmatched data

7. `Common 1.` Clustering

⑴ Type 1: K Means Clustering

⑵ Type 2: Unsupervised Hierarchical Clustering

⑶ Type 3: Matrix factorization

① Algorithm to factorize a known matrix A into W and H matrices: A ~ W × H

○ A matrix: Represents samples and features. Can be inferred from samples.

○ H matrix: Represents variables and features.

○ Similar to K means clustering, PCA algorithm.

○ An autoencoder is a broader concept than matrix factorization because it includes non-linear transformations.

○ The following algorithms are based on the least square method, but can also utilize methods such as gradient descent.

② Algorithm: R = UV, where R ∈ ℝ^5×4, U ∈ ℝ^5×2 , and V ∈ ℝ^2×4, searches for U and V.

import numpy as np
R = np.outer([3,1,4,2.,3],[1,1,0,1]) + np.outer([1,3,2,1,2],[1,1,4,1])
U=np.random.rand(5,2)
V=np.random.rand(2,4)
for i in range(3):
    V,_,_,_=np.linalg.lstsq(U, R, rcond=None)
    Ut,_,_,_=np.linalg.lstsq(V.T, R.T, rcond=None)
    U=Ut.T
U@V # it is similar to R!

③ NMF(non-negative matrix factorization)

import numpy as np
R = np.outer([3,1,4,2.,3],[1,1,0,1]) + np.outer([1,3,2,1,2],[1,1,4,1])
U=np.random.rand(5,2)
V=np.random.rand(2,4)
for i in range(20):
    V,_,_,_=np.linalg.lstsq(U, R, rcond=None)
    V = np.where(V >= 0, V, 0) #set negative entries equal zero
    Ut,_,_,_=np.linalg.lstsq(V.T, R.T, rcond=None)
    U=Ut.T
    U = np.where(U >= 0, U, 0) #set negative entries equal zero
U@V # it is similar to R!

④ Matrix completion (Netflix algorithm): performing matrix factorization on a masked R.

import numpy as np
R = np.outer([3,1,4,2.,3],[1,1,0,1]) + np.outer([1,3,2,1,2],[1,1,4,1])
mask=np.array([[1.,1,1,1],
               [1,0,0,0],
               [1,0,0,0],
               [1,0,0,0],
               [1,0,0,0]])
               
U=np.random.rand(5,2)
V=np.random.rand(2,4)
RR = U@V
RR = RR*(1-mask)+R*mask

for i in range(20):
    V,_,_,_=np.linalg.lstsq(U, R, rcond=None)
    V = np.where(V >= 0, V, 0)
    Ut,_,_,_=np.linalg.lstsq(V.T, R.T, rcond=None)
    U=Ut.T
    U = np.where(U >= 0, U, 0)
    RR = U@V
    RR = RR*(1-mask)+R*mask

RR*(1-mask)+R*mask # it is similar to R!

⑤ Application 1. Cell type classification

○ Aims to obtain cell type proportions from scRNA-seq data from tissues.

○ Important to reduce confounding effects due to cell type heterogeneity.

○ 3-1. Constrained linear regression

○ 3-2. Reference-based approach

○ 3-2-1. CIBERSORT (cell-type identification by estimating relative subsets of RNA transcript): Allows you to check the cell type proportion and p-value for each sample. Nu-Support Vector Regression (NuSVR) is the core algorithm of CIBERSORT.

○ 3-2-2. EPIC

⑥ Application 2. joint NMF: Expands to multi-omics.

⑦ Application 3: Metagene extraction

⑧ Application 4. Starfysh

○ The following algorithm infers archetypes from spatial transcriptomic data and determines anchors representing each archetype.

○ It enables reference-free inference of cell type or cell state DEGs and allows estimation of cell type distribution using H\&E information.

○ Step 1. Construct an autoencoder

○ X ∈ ℝ^S×G: Input data (spots × genes)

○ D: Number of archetypes

○ B ∈ ℝ^D×S: Encoder. In the context of inferring archetypes, the sum of the distribution of each archetype across all spots must be 1.

○ H = BX: Latent variable

○ W ∈ ℝ^S×D: Decoder. In the context of reconstructing the input data, the sum of the weights of all archetypes for each spot must be 1.

○ Y = WBX: Reconstructed input

○ Step 2. Solve the optimization algorithm to calculate W and B

○ Step 3. Spots with the highest weights for each archetype in the W matrix are selected as anchor spots

○ Step 4. Adjust granularity: If the distance between archetypes is close, merge them or use a hierarchical structure to adjust the distance

○ Step 5. Form archetypal communities by searching for the nearest spot for each anchor and identifying marker genes

○ Step 6. If a signature gene set is given, add archetypal marker genes to the existing gene set and recalculate the anchors.

○ During this, use the stable marriage matching algorithm to match each archetype with the most similar signature.

⑷ Category 4. Other clustering algorithms

① SNN (shared nearest neighbor) modularity optimization based clustering algorithm

② Leiden clustering

③ Louvain clustering

④ mean-shift clustering

⑤ DBSCAN (density-based spatial clustering of applications with noise)

⑥ spectral clustering

⑦ gaussian mixture

⑧ watershed algorithm

⑨ thresholding method

⑩ MST (minimum spanning tree)

⑪ curve evolution

⑫ sparse neighboring graph

⑬ SC3

⑭ SIMLR

⑮ FICT

⑯ fuzzy clustering

8. `Common 2.` Differentially Expressed Gene (DEG) Analysis

⑴ Definition: Process of finding genes that differ between experimental and control groups.

⑵ Criteria for DEG: Slightly varies depending on experimental design.

① Experimental design

Figure 21. Design matrix

① FC (fold change)

○ Definition: Ratio of treatment’s average gene expression to control’s average gene expression.

○ Issue 1. Division by zero: Genes with such issues are either removed beforehand or assigned special values through separate interpretations.

○ Issue 2. Asymmetrical values around 1: Addressed by using log fold change.

② Adjusted p-value

○ Multiple Testing Problem: The act of conducting multiple statistical tests itself can lead to inaccurate conclusions.

○ Type 1. Control of Family-Wise Error Rate (FWER)

○ Definition: The probability of reaching at least one incorrect conclusion among all hypotheses tested. For instance, a FWER of 5% means there is a less than 5% chance of making even one incorrect conclusion across multiple hypothesis tests. This approach is very conservative, allowing almost no false positives.

○ 1-1. Sidak Correction: Adjusts the alpha threshold rather than the p-values and is used when p-values are independent.

○ 1-2. Bonferroni Correction: Directly adjusts each p-value and can be applied even when p-values are not independent. It is highly conservative.

○ Type 2. Control of False Discovery Rate (FDR)

○ Definition: The method of controlling the probability that a null hypothesis is included among the actually rejected hypotheses (FDR, the rate of Type I errors) to be below a certain level.

○ 2-1. Benjamini–Hochberg (B&H): Used when the correlations between tests are relatively simple.

○ 2-2. Benjamini–Yekutieli (B&Y): Used when the correlations between tests are complex.

○ Type 3. FWER and FDR control can be applied simultaneously.

③ Visualization 1. Volcano plot

○ x-axis: log fold change

○ y-axis: log adjusted p-value

○ Useful for visualizing the distribution of genes satisfying the DEG condition.

④ Visualization 2. MA plot

○ x-axis: Mean of log normalized count

○ y-axis: log fold change

Widely used in microarray analysis and applicable to RNA-seq.

⑶ Statistical Techniques

① t test

○ One of the commonly used statistical techniques

○ Parametric statistical estimation

○ Applies t-test after taking the log of expression (FPKM/TPM)

○ Not recommended for small sample sizes due to difficult variance estimation: In this case, DESeq, EdgeR, limma are recommended

○ Strict application of FC threshold is recommended

② Mann-Whitney-Wilcoxon (Wilcoxon rank-sum test)

○ Non-parametric statistical estimation

○ Not recommended for sample sizes less than 10

○ For small samples, it is recommended to use limma, edgeR, DESeq2, etc.

③ ANOVA & Kruskal-Wallis test

○ Used for multi-level single factor DEG analysis

○ For sample sizes less than 10, it’s recommended to use DESeq2, edgeR, limma’s multi-level single factor mode

④ FC cutoff

○ Fold change cutoff implicitly assumes a constant variange, e.g. from prior.

⑷ DEG Tools

① DESeq, DESeq2 (paper, manual)

○ Input: raw count

○ Assumes that we can obtain better variance estimates of genes without assuming that they all have the same variance.

○ Does not support random effect modeling or mixed effect modeling: limma is somewhat flexible, so it can be done.

○ Based on the negative binomial model

○ The transcriptome has a large variance due to many zero counts, making it more preferred over the normal or Poisson distribution.

○ The negative binomial model can be sensitive to outliers: Newest version of DESeq2 add extra features to be robust to outliers.

○ Estimate variance using the empirical Bayes model under the following two options:

○ Option 1. dispersion

○ Step 1. Perform maximum likelihood estimation (MLE) for each gene to obtain the black data points in the graph.

○ Step 2. Derive the red trend line from the black points, which serves as the prior mean.

○ Step 3. Compute the MAP (maximum a posteriori) estimate, indicated by the blue arrows.

○ Step 4. Data points that are not shrunk towards the prior can be considered outliers, marked with blue circles.

Figure 22. DESeq dispersion mode

○ Option 2. fold change (optional)

○ DESeq2 provides shrinkage for log2FC estimate (older version set this as default): lfcShrink()

○ LFC for all genes are first used as prior to estimate the trend, and “shrink” the LFC likelihood estimate.

○ Example: In the following figure, you can see that the green data points, which have a high likelihood for the data, are not well shrunk towards the prior.

Figure 23. DESeq fold change mode

Black line: prior Solid line: unshrunken estimate Dotted line: shrunken LFC estimate

○ Flags outlier for each gene using Cook’s distance: The distance is a metric to measure the influence of a data point in least-square analysis.

○ Small sample size: Remove gene from analysis.

○ Large sample size: If there are samples detected as outliers for a specific gene, exclude those samples from the analysis.

○ Note the different preprocessing methods for obtaining DEGs and visualization.

② edgeR

○ edgeR (exact), edgeR (GLM)

○ Uses TMM normalization and negative binomial generalized linear model (GLM).

○ edgeR by default filters out genes with <5 reads total to estimate a common dispersion value.

○ Several options to estimate dispersion

○ Common: Uses single, common dispersion estimate. Not recommended.

○ Trended: Estimate from the trend function.

○ Tagwise: Bayesian moderated approach, similar to DESeq2.

③ edgeR-QLF

○ Models variance as Var(y_gi) = σ_g²(μ_gi + μ_gi²φ) to allow for over-dispersion.

○ Dispersion φ: A function of gene abundance (φ(A)), and it fits dispersion for the trend estimate for the empirical mean-variance relationship.

○ σ_g² is a gene-specific variance (quasi-dispersion parameter). This is fit with an empirical Bayes method. A trend funciton is fit between the gene abundance and the raw QL dispersion estimates. The raw QL estimate (likelihood) then shrink towards the mean fitted trend for that abundance using the empirical Bayes approach.

○ μ_gi + μ_gi²φ (i.e. σ_g² = 1) corresponds to the variance of a standard negative binomial.

④ limma

○ limma + voom (paper, manual)

○ Input: raw count (avoid using normalized data like FPKM with voom)

○ 1^st. Convert read count to log CPM.

○ 2^nd. Use of empirical Bayes method (eBayes(); assumes normal distribution): Removes the correlation between mean and variance present in count data.

○ 3^rd. DEG exploration based on Generalized Linear Model (GLM).

○ The matrix used for obtaining DEGs is also used for visualization

○ Assumes that we can obtain better variance estimates of genes without assuming that they all have the same variance.

○ Generally is more conservative than edgeR and DESeq2.

○ limma FPKM (paper, manual)

○ Input: FPKM (useful when raw count data is not available)

○ 1^st. Convert FPKM to log2 scale

○ 2^nd. Run limma’s eBayes() function with the trend = TRUE option

○ This method is similar to what’s used in microarray analysis and is akin to the limma-trend method

⑤ Sseq

⑥ CuffDiff: Uses FPKM values.

⑦ BaySeq

⑧ DEGSeq

⑨ NOISeq

⑩ PoissonSeq

⑪ SAMSeq

⑫ scVI

○ Provides normalized expressions that remove batch effects and provides code to calculate DEGs based on them (ref).

○ The provided code for calculating DEGs (Differentially Expressed Genes) is not only slow but also inaccurate: removing batch effects from 10 datasets might slightly differ from the removal in just 2 of those datasets, which can sometimes lead to completely opposite DEG patterns.

○ It is recommended to use scVI normalized expression and scanpy.tl.rank_genes_groups to identify DEGs.

⑬ Clustering-free DEG algorithm

○ Distribution-test-based: SingleCellHayStack, MarcoPolo (on/off method + Poisson distribution assumption), SEMITONES, SchayStack

○ Differential-statistics-based: DUBStepR, M3Drop, Cofea, BigSur

○ TAMER: Mutual exclusivity (≠ anti-correlation) is employed.

⑸ Selection of techniques

Sample size	Count depth
	Low counts (~20 M or less)	High counts (~30 M+)
Small (3-9)	Bayesian count-based test (e.g., edgeR QLF, DESeq2)	Bayesian method (count-based or continuous)
Medium (10-30)	Bayesian count-based test (e.g., edgeR QLF, DESeq2)	Count-based or continuous; possibly non-parametric
Large (>30)	Count-based test	Many options: count-based, continuous, non-parametric

Table 6. Selection of techniques

9. `Common 3.` Gene Set Enrichment

⑴ Overview

① Definition

○ Analyzing gene sets as a whole, rather than individual genes.

○ Gene Score: Refers to the value generated by a list of genes.

○ Signature: A dataframe composed of gene names, FC (fold change) values, or p-values. Gene score can also be considered a signature.

② Method

○ Method 1. ORA (Over-Representation Analysis)

○ Tests whether a specific pathway contains more DEGs than would be expected by chance.

○ Classified as a competitive test rather than a self-contained test.

○ Self-contained test: Evaluates differential expression or specific patterns using only the gene set of interest.

○ Competitive test: Compares the gene set of interest against the entire background set of genes to determine whether it contains a significantly higher number of genes associated with a particular pathway.

○ 1-1. Fisher’s exact test (hypergeometric test)

Table 7. Contingency table in gene set analysis

○ Statistic 1. Probability: Probability of being like the sample.

○ Statistic 2. Odds ratio: Shows if GO and Gene Set are similar or dissimilar.

○ Statistic 3. Gene ratio: Represents A / (A+B), i.e., the ratio of common genes to the input gene set.

○ Statistic 4. Count: Usually represents A, the number of intersection elements between two sets.

○ An example of overrepresentation analysis.

○ 1-2: Modified Fisher’s exact test (e.g., DAVID)

Figure 24. Modified Fisher’s exact test

○ When the number of DEGs is small, the impact is significantly reduced, whereas when the number is large, the impact is minimal.

○ An example of overrepresentation analysis.

○ Method 2. Gene set ranking (scoring) method

○ Tests each pathway using ranks or significance values across genes.

○ Example: GSEA, singscore

○ Method 3. Network-based method

③ Interpretation

○ Rank genes based on expression between experimental and control groups, then calculate enrichment scores.

○ If genes from the query gene set are highly concentrated at both extremes (e.g., top 10 or bottom 10) of the ranked gene list, the enrichment score increases.

⑵ Type 1. GSEA (gene set enrichment analysis) (manual)

① Overview

○ Non-parametric permutation-based.

○ Widely used, especially with human and cancer-related data.

② Input

○ Type 1. expression file (.gct or .txt): Expression values

○ Type 2. phenotype label (.cls): Group names (e.g., control, normal)

○ Type 3. gene annotation (.chip): Gene names

○ Type 4. gene set data: Gene sets

○ Category 1. M1H: Mouse-ortholog hallmark gene sets, etc.

○ Category 2. M1: Positional gene sets, etc.

○ Category 3. M2: Curated gene sets, etc.

○ Category 4. M3: Regulatory target sets, etc.

○ Category 5. M5: Ontology gene sets, etc.

○ Category 6. M8: Cell type signature gene sets, etc.

③ Implementation

○ GSEA_R: R package. Increase # of iterations to 10,000. Better for large sample sized experiments.

○ Some “gsea” implementations in R (e.g., fgsea) are different.

④ Output

○ FDR (false discovery rate)

○ NES (normalized enrichment score)

○ leading edge subset

⑤ Python code

def GSEA(gene_set, expression):
    """Perform GSEA testing on a gene set and expression data set

    Args:
        gene_set (list): list of gene names in a gene set
        expression (dict): dictionary with keys of gene names and values of expression

    Returns:
        enrichment (float): enrichment score for a given gene set
        scores (list): list of running enrichment scores as genes are processed
    """
    
    # Order genes in a descending orer according to expression 
    ranked_genes = sorted(expression.items(), key=lambda x: x[1], reverse=True)
    ranked_gene_names = [gene for gene, _ in ranked_genes]
    
    N = len(ranked_gene_names)
    Nh = len(gene_set)
    Nm = N - Nh
    
    if Nh == 0 or Nm == 0:
        raise ValueError("Gene set is empty or covers the entire gene universe.")
    
    hit_score = 1.0 / Nh
    miss_score = 1.0 / Nm
    
    running_score = 0.0
    scores = []

    for gene in ranked_gene_names:
        if gene in gene_set:
            running_score += hit_score
        else:
            running_score -= miss_score
        scores.append(running_score)
    
    # Enrichment score: maximum deviation from zero
    enrichment = max(scores)
    
    return enrichment, scores


def permute_GSEA(enrichment, gene_set, expression, permutations=1000, seed=42):
    """Compute p-value for GSEA results using random permutation testing.
    
    Args:
        enrichment (float): enrichment of the gene set
        gene_set (list): list of gene names in a gene set
        expression (dict): dictionary with keys of gene set names and values of expression
        permutations (int): number of permutations
        seed (int): seed for random.sample
        
    Returns:
        p_val (float): empirical p-pvalue for enrichment
        

    """
    import numpy as np 
    import random

    observed_enrichment = GSEA(gene_set, expression)[0]
    enrichments = []
    
    random.seed(seed)
    gene_list = list(expression.keys())
    
    for _ in range(permutations):
        permuted_gene_set = random.sample(gene_list, len(gene_set))
        permuted_enrichment = GSEA(permuted_gene_set, expression)[0]
        enrichments.append(permuted_enrichment)
    
    enrichments = np.array(enrichments)
    p_val = np.sum(enrichments >= enrichment) / (permutations + 1)

    return p_val

⑶ Type 2. GO (gene ontology)

① Major bioinformatics initiative to integrate the expression of genes and gene products across all species

② Grouped according to CC (cellular component), MF (molecular function), BP (biological process), etc.

③ Evidence code: Genes are assigned to gene ontology terms using various sources of information. The source of information is stored in the evidence codes.

○ Type 1. Experimental evidence code: Experiment (EXP), Direct assay (IDA), Physical interaction (IPI), Mutant phenotype (IMP), Genetic interaction (IGI), Expression pattern (IEP)

○ Type 2. Computational evidence code: Sequence or structural similarity (ISS), Sequence orthology (ISO), Sequence (ISA), Sequence model (ISM), Genome context (IGC), Reviewed computational analysis (RCA)

○ Type 3. Author statement: Traceable author statement (TAS), Non-traceable author statement (NAS)

○ All but one type of evidence code is assigned by a curator; the one exception is inferred from Electronic Annotation (IEA).

③ Implementation: Functions implemented through R

library(EnhancedVolcano)
library(clusterProfiler)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(enrichplot)

GO.plot <- function(gene){
    # ont = "ALL", "BP", "CC", "MF"
    # showCategory is not mandatory

    gene <- gsub("GRCh38", "", gene) # human 데이터 가공시의 reference 이름 제거
    gene <- gsub("mm10", "", gene) # mouse 데이터 가공시의 reference 이름 제거
    for(i in 1:10){
  	  gene <- gsub("-", "", gene) # 불필요한 앞부분의 - 제거
    }
    gene <- gsub('\\ .*$', '', gene) # 'KLK2 ENSG00000167751' 같은 것을 해결 
    
    if (gene[1] == toupper(gene[1])){ ## Human gene
        gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)
        gene.df <- as.vector(gene.df[[2]])
        GO <- enrichGO(gene.df, OrgDb = 'org.Hs.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
        dotplot(GO,split="ONTOLOGY", showCategory = 5)+facet_grid(ONTOLOGY~., scale="free")
	} else{ ## Mouse gene?
        gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
        gene.df <- as.vector(gene.df[[2]])
        GO <- enrichGO(gene.df, OrgDb = 'org.Mm.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
        dotplot(GO,split="ONTOLOGY", showCategory = 5)+facet_grid(ONTOLOGY~., scale="free")
    }
}

### Example
GO.plot(c("COL1A1", "COL1A2", "COL3A1", "COL6A3"))

④ Example results

Figure 25. Example results of GO analysis

○ Count: Size of the intersection between input gene set and each GO term, i.e., the number of common genes

○ p.adjust: p value adjusted using Fisher’s exact test and Benjamini-Hochberg (B&H) adjustment between input gene set and each GO term

○ GeneRatio: Gene ratio, i.e., proportion of common genes in the input gene set (ref)

○ ▶ How to interpret a GO plot

⑤ Website

⑷ Type 3. DAVID (functional annotation bioinformatics microarray analysis)

① Provides biological interpretation of submitted genes: Still works relatively well, even though developed for microarray data

② Use with gene expression data or genomic regions restricted to < 1-5 kb from TSS.

③ How to use DAVID

⑸ Type 4. MSigDB (molecular signature database)

① Classification

○ H: Hallmark gene sets (50 terms)

○ C1: Positional gene sets (299 terms)

○ C2: Curated gene sets (6226 terms)

○ C3: Regulatory target gene sets (3556 terms)

○ C4: Computational gene sets (858 terms)

○ C5: Ontology gene sets (14765 terms)

○ C6: Oncogenic signature gene sets (189 terms)

○ C7: Immunologic signature gene sets (4872 terms)

○ C8: Cell type signature gene sets (302 terms)

② Difference between GO and MSigDB

○ GO: Many species

○ MSigDB: Human-centric. Some mouse (can convert to rat)

⑹ Type 5. EnrichR

① Performs gene set analysis using various gene set information.

② enrichr (R package; an online-based API) or web tool

③ Statistical significance calculation using Fisher’s exact test and Benjamini-Hochberg (B&H) adjustment.

④ Shows concordance between submitted genes and other annotated gene sets

⑺ Type 6. ToppGene, ToppFun, etc: Web-tool

① Transcriptome

② Proteome

③ Regulome (e.g., TFBS, miRNA)

④ Ontologies (e.g., GO, Pathway)

⑤ Phenotype (e.g., human disease, mouse phenotype)

⑥ Pharmacome (e.g., Drug-Gene associations)

⑦ Literature co-citation

⑻ Type 7. iLINCS: Drug database

⑼ Type 8. g:Profiler

⑽ Type 9. KEGG (kyoto encyclopedia of genes and genomes)

① Biochemical pathway database created in Japan (1995)

② One of the most cited databases in the field of biology

③ Includes various information related to systems, genes, health, chemistry, etc., centered around biochemical pathways. 577 pathways.

④ Method: pathview Bioconductor package, https://pathview.uncc.edu/, g:Profiler (Web-based tool)

⑤ KEGG website allows you to color genes to show differential expression.

⑾ Type 10. TRRUST

⑿ Type 11. Metabolic Pathways

⒀ Type 12. REAC (reactome), WP, TF, etc.

① Shows pathways related to submitted genes

② Web-based tools: g:Profiler, iLINCS

⒁ Type 13. IPA (ingenuity pathway analysis)

① Commercial software

② Visualizes pathways and networks through omics data

③ Allows identification of causal mechanisms and key factors using data

④ Includes information like canonical pathways, upstream regulator analysis, and updated information

⒂ Type 14. SPIA (signaling pathway impact analysis)

① Shows signaling pathway topology related to submitted genes

⒃ Type 15. MGI

⒄ Type 16. WP: Can be checked using web-based tools like g:Profiler

⒅ Type 17. TF: Can be checked using web-based tools like g:Profiler

⒆ Type 18. GSVA

⒇ Type 19. AUCell package

⒇ Type 20. HOMER motif analysis

⒇ Type 21. Gold standard method

⒇ Type 22. Ensembl gene

⒇ Type 23. UCSC knownGenes

⒇ Type 24. ssGSEA (single sample GSEA): ssGSEA generates enrichment scores for each sample. The ssGSEA results can then be used to look for outliers in your dataset or to compare enriched pathways among samples.

⒇ Type 25. singscore R package: Similar to ssGSEA. It generates enrichment scores for each sample. The results can then be used to look for outliers in your dataset or to compare enriched pathways among samples.

⒇ Type 26. LRpath: Web tool. Logistic regression

⒇ Type 27. Panther: Related to pathway.

⒇ Type 28. Biocarta: Related to pathway.

⒇ Type 29. MeSH term: Related to disease.

⒇ Type 30. DisGeNET: Related to disease.

⒇ Type 31. Cytoband

⒇ Type 32. Babelomics: Web-tool

⒇ Type 33. clusterProfiler: R package

⒇ Type 34. goseq: R package. Also used for GSE analysis for spatial transcriptomics.

⒇ Type 35. GOrilla: Non-parametric rank-based.

⒇ Type 36. RNA-Enrich: Logistic regression. LRpath website

⒇ Type 37. chip-enrich (example data, R code): GSE analysis for ChIP-seq. For datasets with sharp peaks.

⒇ Type 38. Broad-Enrich: Designed for histone data analysis. Also included in the chipenrich package. For broad genomic regions.

⒇ Type 39. GREAT (Genomic Regions Enrichment of Annotations Tool): Utilizes both binomial distribution and Fisher’s exact test, reducing false positives.

⒇ Type 40. iDEA: GSE analysis for scRNA-seq.

⒇ Type 41. Giotto: GSE analysis for spatial transcriptomics. Uses hypergeometric test, PAGE, or rank-based test for each spot.

⒇ Type 42. SPATA2: Executes the hypeR package, which uses the hypergeometric test. Can also be applied to spatial transcriptomics.

⒇ Type 43. GIGSEA: GSE analysis for GWAS

⒇ Type 44. MAGMA: GSE analysis for GWAS

⒇ Type 45. i-GSE4GWAS: GSE analysis for GWAS

⒇ Type 46. gometh function in the missmethyl package: GSE analysis for CpG sites

⒇ Type 47. Camera (Correlation Adjusted Mean Rank Analysis)

⒇ Type 48. Seq2pathway: R/Bioconductor package

⒇ Type 49. GREAT(genomic regions enrichment of annotations tool): Has very inflated type 1 error.

10. `Common 4.` Gene Interaction Analysis

⑴ Type 1. Cell-cell interaction(CCI, ligand-receptor interaction)

① Principle: Interaction between ligand and receptor in two cells when one cell has high ligand expression and the other has high receptor expression

② bulk-RNA-seq-based

○ BulkSignalR

○ squidy

○ IPA (upstream regulator analysis of ingenuity pathway analysis)

○ Omnipath: Refer here for the code that outputs all ligand-receptor pairs containing a specific gene.

○ BioCypher

③ scRNA-seq-based

○ CellTalkDB (human DB file, mouse DB file)

○ CellPhoneDB (tutorial)

○ CellChat (DB file; 359 unique interaction, 13284 possible combination)

○ ICELLNET

○ NicheNet

○ SoptSC

○ CytoTalk

○ scTensor

○ CCCExplorer

○ Connectome

○ Ramilowski

○ FlowSig: Graphical causal modeling (completed partial directed acyclic graph; CPDAG), Conditional independence test

○ scSeqComm: LR pairs from Reactome + TTRUST + RegNetwork

○ NATMI

○ SingleCellSignalR

④ ST-based

○ Giotto

○ spata2

○ CellPhoneDB v3

○ stLearn

○ SVCA

○ MISTy

○ NCEM

○ COMMOT: Uses optimal transport.

○ SCOTIA: In addition to the optimal transport theorem, it also considers physical distance for greater accuracy

○ STopover: Employs diffusion and Jaccard index

○ cytosignal

○ SpatialDM

○ SpaTalk

○ stMLnet

○ HoloNet

○ DeepLinc

○ scMultiSim

○ GCNG

○ conST

○ Tangram

○ DIALOGUE

○ CellTrek

○ LIANA+: An all-in-one framework incorporating CellphoneDB, CellChat, Connectome, scSeqComm, NATMI, Moran, PyDESeq2, Tensor-cell2cell (higher-order factorizations), scHOT, SpatialDM, etc.

⑤ Protein-protein interaction(PPI, molecular docking)

○ AlphaFold2 multimer, AFM-LIS, AlphaFold3

○ DeepDTA

○ DeepDTAF

○ DeepFusionDTA

○ GraphDTA

○ CAPLA

○ GNINA

○ SMINA

○ GLIDE

○ EquiBind

○ TANKBind

○ DIFFDOCK

⑵ Type 2. Network analysis: Gene-wide association study (GWAS), protein-protein interaction (PPI), etc.

① 2-1. Biological Network Construction

○ Type 1. Gene Regulatory Network

○ Type 2. Protein-Protein Interaction Network

○ Type 3. Co-expression Network

○ It is possible to construct networks based on biological interaction or gene expression data.

○ After constructing the network, it is possible to identify biologically important genes using various network measures/metrics.

② 2-2. Regulatory Network Analysis

○ Uses lasso-like equations such as min |y - βX| + λ|β|.

○ An algorithm that calculates edge agreement in protein-protein networks, gene expression networks, and TF-target gene networks.

○ Builds a network by updating regulatory network edges between TF and target genes.

○ Identifies group differences based on gene-gene interactions.

○ Gene expression regulation mechanism

○ Regulation of transcription initiation frequency

○ Regulation of transcription elongation

○ Alternative transcriptional initiation (ATI)

○ Alternative splicing (AS)

○ Alternative polyadenylation (APA)

○ RNA degradation by RNA Interference (RNAi)

○ Interference of RNAi by long noncoding RNA (lncRNA)

○ Translation initiation regulation

○ Chromosome remodeling and epigenetic regulation

③ 2-3. Sample-Specific Network

○ Proposes methods to construct biological networks for individual samples rather than using data from all samples or constructing group-based networks.

○ Through this, it is possible to discover differences in genes affected by specific genes in individual patients.

④ 2-4. Module / Community Detection

○ Assumes that networks consisting of biological components are not random but are composed of modules performing specific functions.

○ Assumes that nodes (e.g., genes) within a biological network form communities (e.g., pathways).

○ Can perform community/module detection, a network analysis technique.

○ Detected modules/communities can be biologically interpreted through GSEA analysis.

⑤ 2-5. Hub gene detection

○ Types of Networks: Degree Centrality, Betweenness Centrality, Closeness Centrality, Eigenvector Centrality, Participation Coefficient, Pagerank

○ Extracting Core Genes within Modules/Communities through Various Metrics for Finding Hubs in Network Analysis Techniques

⑥ Example 1. ToppNet: Identifies relative importance of candidate genes within a network.

⑦ Example 2. ToppGenet: Ranks neighboring genes within a protein-protein interaction network.

⑧ Example 3. GeneMANIA: Shows functionally similar genes using genomics and proteomics data.

⑨ Example 4. SCINET: Uses scRNA-seq algorithms.

⑩ Example 5. MEAGA(minimum distance-based enrichment analysis for genetic association): Employs the fact that genes from susceptibility loci in the disease-associated function/pathway are closer with each other in biological interactome.

⑪ Example 6. X2K: Displays upstream regulatory networks.

⑫ Example 7. WGCNA: TOM-based clustering.

⑬ Example 8. Louvain / Leiden algorithm: Modularity-based clustering

⑭ Example 9. IPA (Ingenuity Pathway Analysis): upstream regulator analysis of ingenuity pathway analysis

⑮ Example 10. CCCExplorer

⑯ Example 11. Connectome

⑰ Example 12. squidy

⑱ Example 13. spata2

⑲ Example 14. ALIGATOR: Analyzes whether genes associated with GWAS are excessively represented in specific pathways. Targets gene sets (GO, KEGG). Ensures uniqueness and applies multiple testing correction.

⑳ Example 15. INRICH: Analyzes whether gene sets are excessively clustered at specific locations. Uses permutation-based testing.

㉑ Example 16. DAPPLE: Analyzes whether genes are excessively connected within the PPI network. Evaluates connectivity in actual networks versus random networks.

㉒ Example 17. PiNET: Evaluates whether disease-associated genes are excessively connected within the protein-protein interaction (PPI) network. It annotates, maps, and analyzes the peptide moiety.

⑶ Type 3. TF (Transcription Factor) Analysis

① Overview

○ RNA polymerase in eukaryotic organisms cannot bind to the promoter alone.

○ When general transcription factors and specific transcription factors bind to various regulatory sequences upstream of the gene (e.g., TATA, CAAT), RNA polymerase can initiate transcription.

○ The number of transcription factors is approximately 1,600.

○ Pioneer factor: A special transcription factor that can bind even when chromatin is in a closed state. It is the first transcription factor to bind and is much less common.

○ CTCF transcription factor: Binds to DNA for a longer duration (~several minutes) compared to other TFs.

② Algorithm

○ FigR

○ scGRNom

○ scREMOTE

○ Triangulate

○ SCENIC

○ DeepSEM

○ Inferelator

○ Sc-compReg

○ BITFAM

○ Dorothea

11. `Common 5.` Cell Type Mapping Analysis

⑴ General cell type annotation

① Purpose

○ Significance 1. Cell type analysis in scRNA-seq and in-depth analysis of ROI in ST can effectively mitigate batch effects caused by sample selection bias.

○ Significance 2. Analyzing based on cell types rather than gene expression analysis makes the results easier to understand.

② Method 1. Clustering-based

○ Definition: Clustering scRNA-seq data and then labeling cell types based on the differential expression of genes within each cluster.

○ Drawback 1. Divided by resolution parameter: Not all cells in the same cluster necessarily belong to the same cell type.

○ Drawback 2. Cells of the same type might be split into different clusters depending on their states.

○ Drawback 3. Comparing cell type labels across different platforms is challenging (e.g., immune cell vs DC vs cDC)

③ Method 2. Meta-tagging

○ Definition: Attaching multiple cell types to a cell based on scores or multiple rounds of clustering.

○ Example: Labeling a single cell as immune cell, DC, and cDC simultaneously.

④ Method 3. Foundation Model Approach - Using language models to create a single labeler for specifying cell types.

○ GeneFormer: Based on BERT. Uses a transformer encoder-based architecture. Utilized with a pretraining → finetuning approach. Zero-shot capabilities are practically useless.

○ scGPT: Based on GPT. Uses a transformer decoder-based architecture. Utilized with a pretraining → finetuning approach. The zero-shot performance of the pretraining model is also quite excellent.

○ GenePT and Simple use of GPT-4

○ UCE (Universal Cell Embeddings)

○ scfoundation

○ CELLama

○ scBERT

○ GenePT

○ Concerto

○ scTrans

○ GeneCompass

○ scMulan

○ Nicheformer

○ scPRINT

○ Teddy

○ TranscriptFormer: Trained with 112 million cells. MIT license

⑵ Bulk RNA-seq

① xCell: Based on R. Defines weights for each human gene.

② immunedeconv: A benchmarking algorithm based on R.

○ For human data: Modes available include quantiseq, timer, cibersort, cibersort_abs, mcp_counter, xcell, epic, abis, consensus_tme, and estimate.

○ For mouse data: Modes available include mmcp_counter, seqimmucc, dcq, and base.

③ BayesPrism: Bayesian algorithm. Requires a scRNA-seq reference.

④ MixupVI

○ The original scVI architecture was used as is.

○ Mixup loss was added to the ELBO used in variational inference so that the latent space has linearity.

○ For benchmarking purposes, not only MixupVI but also OLS, NNLS, DWLS, Tape, Scaden, NuSVR (CIBERSORT), RLR, and WNNLS are provided.

⑶ scRNA-seq (Single-cell RNA sequencing)

① Seurat: Based on R. Employs a method called ‘Label Transfer.’

② scanpy: Python-based, especially related to cell type analysis through ingest.

③ Scanorama: Python-based.

④ sc-type: Based on R. Utilizes pre-defined gene sets for each cell type. Semi-automation

⑤ celltypist and celltypist2: Python-based. Logistic regression model. Saving pre-defined references in a .pkl file.

⑥ scTab

⑦ SELINA

⑧ Spoint

⑨ Tangram

⑩ TACCO

⑪ InsituType

⑫ Symphony

⑬ SingleR: Automated cell-type annotation tool

⑭ scPred: Automated cell-type annotation tool

⑮ CellAssign: Probabilistic model

⑯ Sargent: Scoring-based approach

⑰ CHETAH

⑱ scibet

⑲ scmap

⑳ CellID

㉑ sccatch

㉒ SCINA

㉓ scsorter

㉔ CellMarker 2.0

⑷ ST (Spatial Transcriptomics)

○ CellDART, spSeudoMap, RCTD (Based on R), MIA analysis: (Includes both enrichment and depletion analyses), Cell2location, SPOTlight, DSTG, CellTrek (After co-embedding scRNA-seq and ST, it performs cell type labeling using a distance-based graph and random forest), CytoSpace, Tangram, BayesPrism, DestVI, Stereoscope, SpatialDWLS, GIST, GraphST, Tacco, Smoother, CARD, Celloscope, Starfysh, Seurat, AdRoit, Spatial-ID (annotation transfrer from scRNA-seq), SpaTM-G, SpaOTsc, NLSDeconv, Redeconve, SpaCET, SpatialPrompt, SPADE, SpatialDecon, SONAR, AntiSplodge, UniCell Deconvolve, SpatialDDLS, TransformerST, DEEPsc, RedeHist, LETSmix, scResolve, SPACEL-Spoint, STEM, SpaDecon, Bulk2space, SpatialcoGCN, SD2, GNNDeconvolver, GTAD, STdGCN, DSTG, GTAD, STRIDE, SMART, stVAE, POLARIS, STIE, MUSTANG, BayesTME, RETROFIT, SpatialScope, ST-assign, MAST-Decon, EnDecon, Cottrazm, CellsFromSpace, FAST, CellPie, SpiceMix, LANTSA, NovoSparc, scDOT, DOT, SpaTrio, TACCO, LANSTA, SDDLS, BANKSY, MERINGUE, SpaGCN, BayesSpace, FICT, CMAP

12. `Advanced 1.` Alternative Splicing Analysis (AS Analysis)

⑴ Overview

① AS analysis can be conducted using existing data with splice-aware aligners.

② However, with the emergence of long-read sequencing, chosen as the technology of the year in 2022, more accurate analysis becomes possible.

⑵ Long-Read Sequencing

① Fewer sequencing gaps compared to short-read sequencing.

Figure 26. Long-Read Sequencing vs. Short-Read Sequencing

② Advantage 1. AS Analysis: Enables identification of alternative splicing events and isoforms.

③ Advantage 2. Facilitates the integration of epigenetics and transcriptomics.

④ Example 1. Pacific Biosciences SMRT (Single Molecule Real-Time) Sequencing: Average read length is ~20 kb.

⑤ Example 2. Oxford Nanopore Sequencing: Average read length is ~100 kb.

⑶ (Reference) Alternative Splicing Event

① SE (Skipped Exon): Entire specific exon is either included or excluded.

② A5SS (Alternative 5’ or 3’ Splice Site): Different usage of 5’ or 3’ splice junctions of an exon, not the whole exon.

③ MXE (Mutually Exclusive Exon): Exclusively spliced exon when another exon is spliced and vice versa.

④ RI (Retained Intron): Intron not coding for amino acids is retained or spliced.

⑷ (Reference) Exon Sequencing: Often contrasted with gene sequencing.

① Exon symbols examples:

○ chr15:63553600-63553679:-

○ chr15:56967876-56968046:-

○ chr7:7601136-7601288:+

○ chr11:220452-220552:-

② Gene symbol examples:

○ Human Genes: SIRPA, HBB-BS, etc.

○ Mouse Genes: Sirpa, Hbb-bs, etc.

⑸ Type 1. Event-Based AS Quantification: Quantification by dividing into exons using a count-based model.

① PSI Value: Quantifies PSI for each exon event.

○ Quantifies AS events using Percent-Splice-In (PSI) value.

○ PSI value = inclusion reads / (inclusion reads + exclusion reads)

○ Representative Tool: rMATs

② Exon Usage: Analyzed based on exon counts.

○ Reads can be classified into exon reads and junction reads based on mapped regions.

○ Exon reads: Mapped reads within exon regions.

○ Junction reads: Mapped reads at splice junctions.

○ Exon reads are counted per exon for exon usage (exon-level expression) calculation.

○ Representative Tool: DEXseq

⑹ Type 2. Isoform-Based AS Quantification

① Definition: Estimate expression of each isoform transcript at the transcript level using statistical models.

② Purpose: Improved efficacy in constraints, diagnostics, etc., if target isoforms can be defined.

③ Representative Tool: RSEM

④ Databases Used for Isoform Exploration

○ UniProt: Most well-known protein-related database.

○ Ensembl

○ Reactome

○ GPP Web Portal

○ NCBI Genome Data Viewer

○ NCBI assembly (example): Execute code targeting FASTA and GTF files (e.g., pyfaidx)

13. `Advanced 2.` Trajectory Analysis

⑴ Overview

① CNV (copy number variation): Abnormal number of chromosomes caused by cell division abnormalities. Refers to chromosome deletion or aneuploidy.

② SNP (single-nucleotide polymorphism): A difference in specific nucleotide sequences.

③ This is only possible with direct sequencing methods (e.g., Visium FF, scRNA-seq), not with probe-based methods (e.g., Visium FFPE).

④ In CNV analysis, p refers to the short arm, and q refers to the long arm.

⑵ CNV Analysis Algorithms

① CopywriteR: WGS-based. Analyzes off-target read depth.

② CNVkit: WGS-based. Analyzes deviations in read depth.

③ ASCAT: WGS-based

④ Ginkgo: scDNA-seq-based. Analyzes deviations in read depth.

⑤ InferCNV: scRNA-seq-based.

⑥ CopyKat: scRNA-seq-based.

⑦ Clonalscope: scRNA-seq-based.

⑧ CONICSmat: scRNA-seq-based.

⑨ HoneyBADGER: scRNA-seq-based.

⑩ CaSpER: scRNA-seq-based.

⑪ Numbat: scRNA-seq-based.

⑫ SpatialInferCNV: ST-based.

⑬ SPATA: ST-based.

⑭ STmut: ST-based.

⑮ STARCH: ST-based.

⑯ CalicoST: ST-based.

Table 8. Summary of CNV Analysis Algorithms

⑮ Problems with InferCNV and CopyKat in contrast to Numbat (ref)

○ Q: There seems to be a global baseline shift in the CNV profile produced by Numbat as compared to my other CNV callers; specifically, Numbat calls gains for the segments that appear to be neutral in other analyses, and neutral for segments that appear to be losses.

○ A: Many existing methods (e.g. InferCNV/CopyKAT) infer copy number variations relative to the median ploidy, which can dilute signals of aberrant regions or mistake neutral regions for aberrant due to baseline shifts caused by hyperdiploidy or hypodiploidy. Instead, Numbat first tries to identify diploid regions based on allele evidence (balanced allelic frequencies), and uses these regions as baseline for CNV calling.

⑶ SNP Analysis Algorithms

① Sniffle, Sniffle2: long-read DNA-seq-base

② SCmut: scRNA-seq-based.

③ scSNV: scRNA-seq-based.

④ SComatic: scRNA-seq-based.

⑤ STmut: ST-based.

⑷ Repetitive Sequence Analysis Algorithm

① Overview: In the genome reference, repetitive sequences are either N-masked or converted to lowercase.

② Experimental Methods

○ Southern Blotting

○ Sanger Sequencing

③ Tandem Repeat Identification Algorithms

○ RepeatMasker

○ Tandem Repeats Finder

○ HipSTR: Based on Illumina sequencing data

○ ExpansionHunter: Based on Illumina sequencing data

○ RepeatHMM: Long-read sequence-based STR detection

○ STRique: Long-read signal-based STR detection

○ DeepRepeat: Converts raw electrical signals from long-read sequencing data into images and applies CNN

⑸ Trajectory analysis pipeline

① gene expression along pseudotime: Monocle (ref1, ref2, ref3), TSCAN (ref), Slingshot (ref), PAGA, scEpath, stLearn, SpaceFlow, SIRV, PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding)

② cell abundance along pseudotime: milo (ref), DAseq (ref)

③ trajectory lineage: tradeSeq (ref), LinRace (ref), SCORPIUS (for scRNA-seq) (ref)

④ Trajectory analysis in samples from different batches: Phenopath (ref), Condiments (ref), Lamian (ref)

⑤ RNA velocity (spliced vs unspliced): Velocyte, scVelo, STARsolo, dynamo, MultiVelo (ref), VeloVAE, cellDancer, DeepVelo, VeloVI, STT (ST-based), SPATA (ST-based), TopoVelo (ST-based)

⑥ HMRF(hidden Markov random field): Startle (ST-based)

⑹ Phasing

① Definition: The process of mapping each RNA transcript to either the paternal or maternal chromosome, or at least to distinguishable haplotypes.

Table 9. The principle of phasing

② Primarily studied in F1 hybrid mouse models: RNA transcripts are mapped to each haplotype individually or to a combined reference based on SNPs.

③ Types of RNA-seq phasing: Lapels and Suspender pipeline, Eagle2, SHAPEIT, WhatsHap.

④ Types of Hi-C phasing: HARP, HaploHiC, ASHIC, HiCHap (Ohm).

⑺ ecDNA

① Detection via FISH: Corresponds to experimental methods.

② Detection via long-read sequencing: Decoil, CoRAL, AmpliconArchitect, AmpliconClassifier.

③ Detection via Hi-C: EagleC, NeoLoopFinder.

⑻ HMM (Hidden Markov Model)

① χ = {X_i} is a Markov process and Y_i = ϕ(X_i) (where ϕ is a deterministic function), then y = {Y_i} is a Hidden Markov Model.

② Baum-Welch Algorithm

○ Purpose: Learning HMM parameters

○ Input: Observed data (e.g., list of DNA sequences)

○ Output: Initial probabilities, state transition probabilities, and emission probabilities of HMM

○ Principle: A type of EM (Expectation Maximization) algorithm

○ Formula

○ B_k: Initial probability for state k

○ A_kl: Number of transition from state k to l

○ E_k(b): Number of emissions of observation b from state k

③ Viterbi Algorithm (ref)

○ Purpose: Find the most likely hidden state sequence given an HMM

○ Input: HMM parameters and observed data

○ N: Number of possible hidden states

○ T: Length of observed data

○ A: State transition probability, a_kl = probability of transitioning from state k to state l

○ E: Emission probability, e_k(x) = probability of observing x in state k

○ B: Initial state probability

○ Output: Most probable state sequence

○ Principle: Uses dynamic programming to compute the optimal path

○ Step 1. Initialization

○ b_k: Initial probability of state k, P(s0 = k)

○ e_k(σ): Probability of observing the first observation σ in state k, P(x₀ s₀ = k)

○ Step 2. Recursion

○ Compute maximum probability from the previous state at each time step i = 1, …, T

○ Compute backpointer (ptr) storing the most probable previous state

○ ptr_i(l) serves to store the previous state k that has the highest probability of transitioning to the current state l.

○ Step 3. Termination

○ Select the highest probability at the final time step

○ Determine the last state of the optimal sequence

○ v_k(i - 1): Optimal probability at previous time step i - 1 in state k

○ a_kl: Probability of transitioning from state k to l

○ Step 4. Traceback

○ Trace back through ptr array from i = T, …, 1 to recover the optimal path

○ Example

Figure 27. Example of Viterbi Algorithm

○ Python Code

class HMM(object):
    def __init__(self, alphabet, hidden_states, A=None, E=None, B=None):
        self._alphabet = set(alphabet)
        self._hidden_states = set(hidden_states)
        self._transitions = A
        self._emissions = E
        self._initial = B
        
    def _emit(self, cur_state, symbol):
        return self._emissions[cur_state][symbol]
    
    def _transition(self, cur_state, next_state):
        return self._transitions[cur_state][next_state]
    
    def _init(self, cur_state):
        return self._initial[cur_state]

    def _states(self):
        for k in self._hidden_states:
            yield k

    def draw(self, filename='hmm'):
        nodes = list(self._hidden_states) + ['β']

        def get_children(node):
            return self._initial.keys() if node == 'β' else self._transitions[node].keys()

        def get_edge_label(pred, succ):
            return (self._initial if pred == 'β' else self._transitions[pred])[succ]
        
        def get_node_shape(node):
            return 'circle' if node == 'β' else 'box'
            
        def get_node_label(node):
            if node == 'β':
                return 'β'
            else:
                return r'\n'.join([node, ''] + [
                    f"{e}: {p}" for e, p in self._emissions[node].items()
                ])

        graphviz(nodes, get_children, filename=filename,
                 get_edge_label=get_edge_label,
                 get_node_label=get_node_label,
                 get_node_shape=get_node_shape,
                 rankdir='LR')
        
    def viterbi(self, sequence):
        trellis = {} 
        traceback = [] 
        for state in self._states():
            trellis[state] = np.log10(self._init(state)) + np.log10(self._emit(state, sequence[0])) 
            
        for t in range(1, len(sequence)):
            trellis_next = {}
            traceback_next = {}

            for next_state in self._states():  
                k={}
                for cur_state in self._states():
                    k[cur_state] = trellis[cur_state] + np.log10(self._transition(cur_state, next_state)) 
                argmaxk = max(k, key=k.get)
                trellis_next[next_state] =  np.log10(self._emit(next_state, sequence[t])) + k[argmaxk] 
                traceback_next[next_state] = argmaxk
                
            trellis = trellis_next
            traceback.append(traceback_next)
            
        max_final_state = max(trellis, key=trellis.get)
        max_final_prob = trellis[max_final_state]
                
        result = [max_final_state]
        for t in reversed(range(len(sequence)-1)):
            result.append(traceback[t][max_final_state])
            max_final_state = traceback[t][max_final_state]
            
        return result[::-1]

④ Type 1. PSSM: Simpler HMM structure

⑤ Type 2. Profile HMM: It is advantageous over PSSMs regarding the following:

○ Diagram of profile HMM

Figure 28. Diagram of profile HMM

○ M, I, and D represent match, insertion, and deletion, respectively.

○ M_i can be transitioned to M_i+1, I_i, and D_i+1.

○ I_i can be transitioned to M_i+1, I_i, and D_i+1.

○ D_i can be transitioned to M_i+1, I_i, and D_i+1.

○ Advantage 1. The ability to model insertions and deletion

○ Advantage 2. Transitions are restricted only between valid state traversal.

○ Advantage 3. Boundaries between states are better defined.

○ Example 1. HMMER

○ hmmbuild: Creates a profile HMM from a set of aligned sequences.

○ hmmalign: Aligns a sequence to a profile HMM.

○ hmmsearch: Aligns a profile HMM against a sequence database.

○ hmmsca: Aligns sequences against a database of profile HMMs.

○ Example 2. HMMSTR

○ Example 3. SAM(sequence alignment and modeling)

○ Example 4. Pfam

○ A large collection of protein families, each represented by multiple sequence alignments and profile HMMs.

○ Pfam-A: A high quality, manually curated set of family profiles.

○ Pfam-B: A lower quality set of families identified through automated means (ADDA).

14. `Advanced 3.` Epigenomics Analysis

⑴ Type 1. Gene Function Identification

① Types of sequencing

○ Perturb-seq: Treating Cas9 expressing cells with different gRNA libraries, followed by simultaneous sequencing of gRNA and mRNA.

○ in vivo Perturb-seq

② DEG analysis algorithm for perturb-seq

○ CEDA

○ MAGeCK RRA

○ MAGeCK MLE

○ Gscreen

○ BAGEL2

○ Common

③ Single-cell perturbation prediction

○ scGen: Based on VAE

○ scPreGAN: Based on GAN

○ scVAEDer: Combination of VAE and DDN (a state-of-the-art diffusion model)

○ GeneFormer: Foundation model based on LLM

⑵ Type 2. Transcription Regulation Identification

① Types of sequencing

○ BS-seq (Bisulfite Sequencing): Identifying methylation patterns.

○ ChIP-seq (Chromatin Immunoprecipitation Sequencing): Identifying binding sites of transcription factors.

○ Hi-C (High Throughput Chromatin Conformation Capture Sequencing): 3D folding structure information of nuclear chromatin.

○ DNA Ticker Tape (Prime Editing)

○ ENGRAM (Enhancer-Driven Genomic Recording of Transcriptional Activity in Multiplex)

○ ATAC-seq

○ NOMe-seq (Nucleosome Occupancy and Methylome Sequencing)

○ MBD-seq

○ Ribo-seq

○ Bru-seq & BruChase-seq

② Peak caller (Peak finder)

○ Can be used for ChIP-seq, CUT&RUN, CUT&Tag, MeDIP-seq, MethylCap-seq, hmeDIP-seq, DNase-Seq, ATAC-seq.

○ Step 1. Read alignment and QC: ChIP-seq is usually single-end sequencing, so it’s simple. CUT&Tag analysis uses paired-end sequencing.

○ Step 2. Shift size estimation (read extension): Line up + and - strand reads for single-end data (for ChIP-seq). For example, when there are two peaks at a certain distance on the + and - strands, shifting by 1/2 fragment size merges them into a single peak.

○ Step 3. Determines window size: Most peak-finders use a sliding window.

○ Step 4. Peak detection: Identify peaks in the ChIP sample that are not in the control sample. Involves a statistical test.

○ Step 5. Deals with artifacts: Duplicates, strand difference, strand shift, etc.

○ Step 6. FDR estimation

○ Step 7. Deals with replicate samples, e.g. IDR.

○ Step 8. Downstream analysis, e.g. motif enrichment.

⑶ Type 3. Post-Translational Regulation Identification

① scRibo-seq

② STAMP-RBP

⑷ Type 4. Programmable Cell Function

① RADARS

② LADL (light-activated dynamic looping): photo-activatable gene expression

15. `Advanced 4.` Special Transcriptomics Analysis

⑴ scRNA-seq Analysis

① Normalization: SAMstrt (Katayama et al.), BASiCS (Vallejos et al.), GRM (Ding et al.), Simple Norm (Satija et al.), scran (Lun et al.), SCnorm (Bacher et al.), Linnorm (Yip et al.), RUV-III-NB, sctransform, Giotto, SpaNorm

② Data imputation: Reads imputation, scIGAN, MAGIC, VIPER, DeepImpute, SAUCIE

③ Batch effect removal: See above.

④ Cell cycle estimation: Cyclum, Cyclops, Oscape

⑤ Cell type prediction: SINCERA, SC3, ACTINN, scVI, CSCORE (cell-type-specific correlation)

⑥ CNV/subclone: See above.

⑦ Phylogeny tree: SCARLET, Monovar

⑧ Cell trajectory: See above.

⑨ Gene-gene interaction: PINNACLE, scNET, scLINE

○ PINNACLE: Combines scRNA-seq and protein-protein interaction (PPI) information

○ scNET: Combines scRNA-seq and protein-protein interaction (PPI) information

○ Goal: Learn more accurate expression patterns and gene interaction structures

○ Cell-cell similarity → Uses a KNN graph

○ Gene-gene relationship → Uses a PPI network

○ Inner product decoder → Restores gene-gene relationships

○ Fully connected decoder → Restores gene expression values

○ scLINE: Graph embedding method

⑩ Cell-cell communication: See above.

⑪ Gene network: SCINET, SCENIC

⑫ Gene imputation: scTransform, SAVER, MAGIC, DeepImpute

⑬ Single-cell perturbation prediction

○ scGen: Based on VAE

○ scPreGAN: Based on GAN

○ scVAEDer: Combination of VAE and DDN (a state-of-the-art diffusion model)

○ GeneFormer: Foundation model based on LLM

⑭ scRNA-seq/ST: DTSG

⑵ Spatial Transcriptomics Analysis (ref1, ref2)

① Barcode-based (spot-based) transcription factors: 10x Visium, etc.

② Image-based (FISH-based) transcription factors: 10x Xenium, Vizgen MERSCOPE, Nanostring CosMx, etc.

○ Patent dispute between Nanostring and 10x Genomics (‘23) (ref1, ref2) → Nanostring bankruptcy (ref) and acquisition (ref)

③ ST analysis downstream pipeline

○ Comprehensive Pipeline: Seurat, Squidpy, Scanpy, Giotto, SpatialData, ezSingleCell, SPACEc (Google Colab), Sopa (Google Colab)

○ Dimensionality reduction and clustering: Seurat, Scanpy, Giotto, SPATA, STUtility

○ Identifying spatial domains: SPACEL, STAGATE, GraphST, stLearn, RESEPT, Spatial-MGCN, SpaGCN, ECNN, SEDR, JSTA, STGNNks, conST, CCST, BayesSpace, SpatialPCA, DRSC, Giotto-H, Giotto-HM, Giotto-KM, Giotto-LD, Seurat-LV, Seurat-LVM, Seurat-SLM, IRIS, NeST, SC-MEB, NovoSpaRc, UTAG, BANKSY, NSF, BASS, SpaTopic, SpatialLDA, CeLEry, SpaGene, DeepST, CellTrek, MULTILAYER, RESEPT, Space-GM, SOTIP, SpaTM-S, scNiche, Novae, MuCoST, STAIG, SPADE, HMRF, SpiceMix, MENDER, SpaceFlow, Cellular Neighborhood Clustering (CNC)

Algorithm	Spatial Information Required	Histology Information Required	Cluster Number Required	Program Language
BayesSpace	Yes	No	Yes	R
DRSC	optional	No	optional	R
Giotto-H	No	No	Yes	R
Giotto-HM	Yes	No	Yes	R
Giotto-KM	No	No	Yes	R
Giotto-LD	No	No	No	R
Seurat-LV	No	No	No	R
Seurat-LVM	No	No	No	R
Seurat-SLM	No	No	No	R
SpaCell	No	Yes	Yes	Python
SpaCell-G	No	No	Yes	Python
SpaCell-I	No	Yes	Yes	Python
SpaGCN	Yes	No	optional	Python
SpaGCN+	Yes	Yes	optional	Python
stLearn	Yes	Yes	No	Python

Table 10. Spatial Clustering Methods

○ BayesSpace

○ Employs a Bayesian approach.

○ Higher weights to spatial locations that are physically closer.

○ Its performance might be limited by its fixed smoothing parameter of the Markov random fields.

○ Not computationally scalable for high-throughput spatial transcriptomics data.

○ Includes a batch effect correction.

○ SC-MEB

○ Performs spatial clustering through an “empirical Bayes approach” capable of optimizing the smoothness parameter.

○ Estimates its parameters using an iterative-conditional-mode-based expectation-maximization method to boost its computational efficiency and scalability to high throughput data.

○ novoSpaRc

○ Aim is to minimize discrepancy between shortest paths in expression and spatial data.

○ If 2 cells are close expression-wise, they are likely close spatially.

○ It calculates shortest path lengths from a KNN graph based on correlation-based distance from scRNA-seq data.

○ NeST

○ Step 1. Compute single-gene expression hotspots across full transcriptome

○ Step 2. Construct hotspot similarity network

○ Step 3. Communities in network are averaged into coexpression hotspots representing a shared expression pattern

○ SEDR

○ Uses GNN.

○ STAGATE

○ Uses GNN.

○ Space-GM

○ Uses a GNN to perform supervised learning.

○ UTAG

○ Cell annotation without GNN

○ Includes a batch effect correction.

○ SpatialPCA

○ Uses a kernel matrix in probabilistic PCA to model spatial correlations.

○ SOTIP

○ Clusters cells based on a network of cell neighborhoods.

○ CeLEry

○ Histology- and cytoarchitecture-based manual annotation.

○ Identifying spatially variable genes (SVGs): Moran’s I (R-based Seurat, Python-based Squidpy), Geary’s C (Python-based Squidpy), Landscape Shape Index, SPARK (code), SPARK-X (R-based), SpatialDE (Python-based) (code), SpaGCN, ST-Net, STAGATE, HisToGene, CoSTA, CNNTL, SPADE, DeepSpaCE, conST, Spatial-MGCN, STGNNks, SpatialScope, nnSVG (R-based), Hotspot, MERINGUE (R-based), Trendsceek (code), HMRF (code), Celina, Crescendo, Giotto k-means (R-based), Giotto rank (R-based), SomDE (Python-based), Seurat, Scanpy, GPcounts, SpaTM-S, SpatialCorr, Sinfonia, mark variogram (R-based Seurat), SpaNormSVG, RayleighSelection, MULTILAYER, HRG, scGCO, SpaGene, C-SIDE, CTSV, BSP, DESpace, HEARTSVG, BinSpect, Sepal, BOOST-GP, BOOST-MI, SMASH, SVGbit, PROST, BOOST-HMI, spVC, SpaGFT, SingleCellHayStack, Belayer

Figure 29. A hierarchical summary of SVG detection methods

○ SpatialDE

○ Gaussian process regression

○ To decompose the expression variation into a spatial component and a non-spatial component.

○ Spatial component modeled by spatial covariance based on the pairwise spatial distances among locations.

○ Non-spatial component formulated as noise.

○ Lower computational complexity than Trendsceek. Computational complexity still scales cubically with respect to the number of spatial locations.

○ SPARK

○ Generalized Poisson regression

○ Generative model with a variety of kernels: Computes p-values using each of the kernels and utilizes the Cauchy combination rule to combine the p-values.

○ To detect genes with spatial variation.

○ Lower computational complexity than Trendsceek. Computational complexity still scales cubically with respect to the number of spatial locations.

○ SPARK-X

○ Non-parametric method

○ A covariance matrix for the gene expression and a covariance matrix for the spatial coordinates are utilized.

○ If the gene expressions are independent of the spatial coordinates, the product of the two covariance matrices will be small.

○ Lower computational complexity than Trendsceek, SpatialDE, and SPARK.

○ MERINGUE: spatial autocorrelation measure

○ nnSVG: nearest-neighbor Gaussian process

○ Trendsceek

○ Marked point process theory

○ Gene expression (mark)

○ Spatial location (point)

○ Tests the dependency of distributions of marks and points.

○ High computational complexity

○ SpaGCN

○ GCN approach

○ Integrates gene expression data, spatial location information, and histology images.

○ Aggregating feature information from each node’s neighbors improves the identification of localized gene expression patterns and, consequently, spatially variable genes.

○ SpaGCN is also computationally fast and memory efficient compared to SPARK and SpatialDE.

○ SVCA

○ SVCA uses Gaussian processes with additive covariance to model variation in gene expression.

○ Decomposes into intrinsic + environmental + interaction effects: interactions are modeled by integrating gene expression and spatial differences.

○ Celina

○ Multiple kernel strategy

○ Enhancing gene expression resolution (GER): BayesSpace, XFuse, DeepSpaCE, HisToGene, SuperST, TESLA, iStar, Thor (anti-shrinking Markov graph diffusion method), TransformerST

○ iSTAR

○ Used to enhance the resolution of spatial transcriptomics. Uses a BEiT-based model trained with the DINO method.

Figure 30. iSTAR Data Preparation Diagram

○ Step 1. Partition the given image into 256 × 256 patches.

○ Step 2. Subdivide each patch into 16 × 16 sub-patches.

○ Step 3. Apply ViT (denoted as f₂) to each sub-patch to obtain a 384-dimensional vector.

○ Step 4. Collect the 384-dimensional vectors to create a 16 × 16 × 384 data structure, then apply another ViT (denoted as f₁) to obtain a 192-dimensional vector.

○ Step 5. Collect the 192-dimensional vectors and apply ViT (denoted as f₀).

○ Feature and loss function formulation

○ Gene imputation: LIGER, SpaGE, stPlus, Seurat, Tangram, gimvI, GeneDART, NovoSpaRc, spARC, BLEEP, SpaOTsc

○ Gene-gene interaction: scHOT, GCNG, MISTy, MESSI, SEAGAL (Google Colab), scMultiSim

○ Cell-cell interaction (CCI): Giotto, MISTy, stLearn, GCNG, conST, COMMOT, NCEM, Tangram, DIALOGUE, CellTrek, SVCA, spata2, CellPhoneDB v3, SCOTIA, STopover, cytosignal, SpatialDM, SpaTalk, stMLnet, HoloNet, DeepLinc, scMultiSim

○ Cell type deconvolution/identification: MIA, Stereoscope, RCTD, Cell2location, DestVI, STdeconvolve, SPOTlight, SpatialDWLS, GIST, GraphST, DSTG, Tangram, CellDART, spSeudoMap, Tacco, Smoother, CARD, Celloscope, Starfysh, Seurat, AdRoit, Spatial-ID (annotation transfrer from scRNA-seq), SpaTM-G, SpaOTsc, NLSDeconv, Redeconve, SpaCET, CellTrek, SpatialPrompt, SPADE, SpatialDecon, SONAR, AntiSplodge, UniCell Deconvolve, SpatialDDLS, TransformerST, DEEPsc, RedeHist, LETSmix, scResolve, SPACEL-Spoint, STEM, SpaDecon, Bulk2space, SpatialcoGCN, SD2, GNNDeconvolver, GTAD, STdGCN, DSTG, GTAD, STRIDE, SMART, stVAE, POLARIS, STIE, MUSTANG, BayesTME, RETROFIT, SpatialScope, ST-assign, MAST-Decon, EnDecon, Cottrazm, CellsFromSpace, FAST, CellPie, SpiceMix, LANTSA, NovoSparc, scDOT, DOT, SpaTrio, CytoSPACE, TACCO, LANSTA, BayesPrism, BASS, SDDLS, BANKSY, MERINGUE, SpaGCN, BayesSpace, FICT, CMAP

Figure 31. Types of cell type deconvolution algorithms

○ Cell segmentation: Watershed, Cellpose, JSTA, Baysor, GeneSegNet, SSAM, ClusterMap, SCS, QuST, Proseg, Bioturing segmentation (github), InstanSeg, StarDist, CellViT (github, application), HistAI (github), BIDCell, FICTURE, Comseg, Points2Regions, Sainsc, UCS, HoVer-Net, Triple U-Net, CDNet, Omnipose, CPP-Net, PathoSAM, NuLite, ilastik, MESMER

○ Cellpose: Trains an artificial neural network to determine whether the gradients of pixels are directed toward the cell interior. Generates nuclear and cell segmentation results in .tif and .npy formats.

○ QuST: An extension of QuPath. Generates nuclear and cell segmentation results in GeoJSON format.

○ Baysor: Bayesian mixture model is used. Considers transcript location·composition, cell size·shape to determine cell boundaries.

○ Proseg: Considers transcript location·composition, cell size·shape to determine cell boundaries. Cellular Potts model, MCMC, and transcript repositioning are used.

○ BIDCell: Self-supervised deep learning

○ Points2Regions: Unsupervised clustering, K-means clustering

○ Segmentation-free method: FICTURE, Points2Regions, ComSeg

○ Spatial niche: NicheNet, Nicheformer, CellCharter (Google Colab), GraphSAGE, SIMVI

○ Cellcharter: Employes scVI embedding and GMM clustering.

○ Image alignment (Image registration): DiPY, bUnwarpJ (ImageJ), STalign, SpatialSPM

○ Image-to-gene-expression: NSL, DeepSpaCE, ST-Net, HisToGene, Hist2ST, THItoGene, TCGN

○ 3D reconstruction: PASTE, SPACEL, STAGATE, STUtility (only uses histology images, not spatial gene expression), Splotch (for older ST platform; requires manual annotations), GPSA, XVFI, InterpolAI

○ PASTE

○ An algorithm that merges and aligns multiple ST slices derived from the same tissue by considering gene expression similarity and spatial distance.

○ Uses NMF to extract latent features of gene expression and incorporates both latent features and spatial distances through the optimal transport theorem.

○ Enables more accurate clustering across multiple slices.

○ Modes: pairwise slice alignment, center slice integration.

○ Does not use tissue images.

○ The gene expression values for each spot use the existing raw counts or normalized counts, but the position of that spot is adjusted in 3D space.

○ STUtility

○ Uses only tissue images without considering gene expression or spatial coordinates.

○ Splotch

○ Developed for older ST platform data.

○ Requires manual annotation.

○ XVFI: 3D biomedical image reconstruction

○ InterpolAI

○ A model for restoring missing or damaged images in 3D biomedical image stacks (e.g., H&E, MRI, or fluorescence images).

○ A modified version of the FILM (Frame Interpolation for Large Motion) model, originally used for video frame interpolation.

○ At its simplest, one might consider linear interpolation, but this model is significantly more accurate.

○ Capable not only of interpolation between 2D images, but also of 3D rendering of the interpolated 2D images.

○ CNV inference: SpatialInferCNV, SPATA2 (demo), STmut, STARCH, CalicoST

○ Trajectory analysis: stLearn, SpaceFlow, SPATA2, Startle, Spateo, MOSCOT, SLAT, Hotspot, cell2fate, GASTON, STT, TopoVelo

○ spatial data simulation : scDesign3

Figure 30. ST analysis downstream pipeline

④ 3D spatial transcriptomics (ref1, ref2, ref3, ref4, ref5)

⑤ subcellular ST: nearest-neighbor, InSTAnT, TopACT (cell type classification), APEX-seq-based atlas, Bento, TEMPOmap, subcellular mRNA kinetic modeling, Rustem et al., Pengfei et al.

⑶ Temporal Transcription Factor Analysis

① Type 1. short time series data

○ STEM (short time-series expression miner)

○ TimesVector

② Type 2. Long time series data: similar to analysis of general experimental groups

③ Type 3. Temporal sequencing

○ Record-seq

○ Live-seq

○ TMI

○ molecular recording

⑷ Spatiotemporal Omics

① ORBIT (single-molecule DNA origami rotation measurement)

② 4D spatiotemporal MRI or hyperpolarized MR

③ in vivo 4D omics with transparent mice

⑸ Spatial Tri-omics

① DBiT ARP-seq (ATAC + RNA + Protein)

② DBiT CTRP-seq (CUT&Tag (H3K27me3) + RNA + Protein)

16. `Advanced 5.` Database Utilization

⑴ Bioinformatics Resources

① Examples: PubMed, NCBI, bioRxiv, BioStars, Bioinformatics Stack Exchange, Stack Overflow

② Data Repositories

○ ArrayExpress (EBI)

○ Gene Expression Omnibus (GEO): fastq-dump is no longer supported.

○ GenomeRNAi, dbGAP

○ The European Genome-phenome Archive (EGA)

○ Database of Interacting Proteins (DIP)

○ IntAct

○ Japanese Genotype-phenotype Archive (JGA)

○ NCBI PubChem BioAssay

○ Genomic Expression Archive (GEA)

○ GWAS Catalog

○ UCSC Genome Browser

③ Webpage Crawling

⑵ Small Molecule Database

① Integrated Small Molecule Database: Database providing data on the physiological activity of about 800,000 small molecules in vector format

② AlphaFold2 Database: Database with structural data of 200 million proteins

③ ensembl: Transcriptome database

④ uniprot: Protein database

⑤ The Human Protein Atlas: Public access resource aiming to map all human proteins in cells, tissues, and organs

⑥ SGC (Chemical Probes): Provides a unique probe collection along with related data, control compounds, and usage recommendations

⑶ Spatial Transcriptomics database

① HCA

② HuBMAP

③ SODB

④ STOmicsDB

⑤ SpatialDB

⑥ SOAR

⑦ HTAN

⑧ Allen Brain Map

⑷ Drug Genomics Database

① NCBI dbSNP

② gnomAD

③ pharmVar

④ PHARMGKB

⑤ NCBI PubChem

⑥ Broad Institute CMAP

⑦ CTD

⑧ Comptox

⑨ Drug Bank

⑩ Stitch (search tool for interactions of chemicals)

⑪ ToppFun

⑫ depmap

⑬ L1000CDS2

⑭ L1000FWD

⑮ GDSC (Genomic of Drug Sensitivity in Cancer)

⑯ CCLE

⑰ ClinicalTrials.gov: Provides information on clinical trial progress for each drug

⑸ Clinical and Non-clinical Databases

① TCGA (The Cancer Genome Atlas)

○ Diverse human tumor profiling data based on DNA, RNA, protein expression, and epigenetic factors

○ How to obtain TCGA data

② GWAS Catalog

○ Provides curated information and educational resources on genome-related information to identify causal variants and understand disease mechanisms for new therapies

③ MGI (Mouse Genome Informatics)

○ Database collecting mouse mutation, phenotype, and disease data. Each gene ontology (GO) is well organized

④ Open Targets Platform

○ Integrated platform for phenotype data such as expression, co-localization, and prioritization signature related to specific targets associated with certain diseases

⑤ UK Biobank

○ Metabolome: Sample of 120,000. Measured from blood collected between 2006 and 2010

○ Blood Biomarkers: Sample of 500,000. Measured from blood collected between 2006 and 2010

○ Genome (GWAS, WES, WGS): Sample of 500,000. Measured from blood collected between 2006 and 2010

○ Summary-level Clinical Records: Sample of 500,000. Hospital diagnoses (ICD codes) and date of first diagnosis

○ Record-level Clinical Records: Sample of 250,000. Date-specific diagnosis/prescription records. The start year for tracking is as follows

○ 1997 for England

○ 1998 for Wales

○ 1981 for Scotland

⑥ gnomAD

○ A large-scale public database aggregating human genome variation data: Primarily used as a reference for genetic variation and rare disease research.

○ Method 1. Google Cloud: Use the command $ gsutil ls gs://gcp-public-data--gnomad/release/. Also available as a BigQuery dataset.

○ Method 2. AWS: Use the command $ aws s3 ls s3://gnomad-public-us-east-1/release/. Utilizes the AWS Command Line Interface.

○ Method 3. Azure: Use the command $ azcopy ls https://datasetgnomad.blob.core.windows.net/dataset/. Can be accessed via AzCopy or Azure Storage Explorer.

○ Method 4. Hail

○ Method 5. Terra

⑦ Other Clinical Databases

Country	Institution	Clinical Data	Genomic Data	Transcriptomic Data	Proteomic Data	Imaging Data
USA	PRIMED	O	O
USA	All of US	O	O
USA	National Institute of Diabetes and Digestive and Kidney Diseases	O	O
USA	Slim Initiative in Genomic Medicine for the Americas	O	O
UK	UK Biobank	O	O	O	O	O
China	Kadoorie Biobank	O	O		O	O

Table 11. Other Clinical Databases

⑹ Policy Databases

① EQIPD Quality System: New non-clinical research quality system developed by the European Quality in Preclinical Data (EQIPD) consortium composed of 8 countries and 29 institutions, applicable to both public and private sectors

② FAIRsharing: Curated information and educational resources on database and data policy-related data and metadata standards

⑺ Networking Platforms

① Chemicalprobes.org: Portal site to receive expert advice on finding and using chemical probes in pharmaceutical research and drug development

② European Lead Factory: Collaborative public-private partnership site for innovative drug development

③ Genotype-Tissue Expression project: Public resource project for tissue-specific gene expression and regulation

④ GOT-IT Expert Platform : Platform facilitating exchange between academic researchers and industry experts to promote new academia-industry collaborations

⑤ SPARK Global Initiative: International network focusing on exchanging specialized knowledge and addressing immediate unmet medical needs, enhancing and developing projects

Input: 2021.10.02 13:49

Modification: 2023.07.11 11:19

2050

Transcriptomics Analysis Pipeline

1. `QC 1.` Experimental QC (Sample Level Quality Control)

2. `QC 2.` Data QC (Sequence Level Quality Control)

3. `QC 3.` Filtering

4. `QC 4.` Alignment

5. `QC 5.` Normalization

6. `QC 6.` Batch Effect

7. `Common 1.` Clustering

8. `Common 2.` Differentially Expressed Gene (DEG) Analysis

9. `Common 3.` Gene Set Enrichment

10. `Common 4.` Gene Interaction Analysis

11. `Common 5.` Cell Type Mapping Analysis

12. `Advanced 1.` Alternative Splicing Analysis (AS Analysis)

13. `Advanced 2.` Trajectory Analysis

14. `Advanced 3.` Epigenomics Analysis

15. `Advanced 4.` Special Transcriptomics Analysis

16. `Advanced 5.` Database Utilization

results matching ""

No results matching ""

Transcriptomics Analysis Pipeline

1. QC 1. Experimental QC (Sample Level Quality Control)

2. QC 2. Data QC (Sequence Level Quality Control)

3. QC 3. Filtering

4. QC 4. Alignment

5. QC 5. Normalization

6. QC 6. Batch Effect

7. Common 1. Clustering

8. Common 2. Differentially Expressed Gene (DEG) Analysis

9. Common 3. Gene Set Enrichment

10. Common 4. Gene Interaction Analysis

11. Common 5. Cell Type Mapping Analysis

12. Advanced 1. Alternative Splicing Analysis (AS Analysis)

13. Advanced 2. Trajectory Analysis

14. Advanced 3. Epigenomics Analysis

15. Advanced 4. Special Transcriptomics Analysis

16. Advanced 5. Database Utilization

results matching ""

No results matching ""

1. `QC 1.` Experimental QC (Sample Level Quality Control)

2. `QC 2.` Data QC (Sequence Level Quality Control)

3. `QC 3.` Filtering

4. `QC 4.` Alignment

5. `QC 5.` Normalization

6. `QC 6.` Batch Effect

7. `Common 1.` Clustering

8. `Common 2.` Differentially Expressed Gene (DEG) Analysis

9. `Common 3.` Gene Set Enrichment

10. `Common 4.` Gene Interaction Analysis

11. `Common 5.` Cell Type Mapping Analysis

12. `Advanced 1.` Alternative Splicing Analysis (AS Analysis)

13. `Advanced 2.` Trajectory Analysis

14. `Advanced 3.` Epigenomics Analysis

15. `Advanced 4.` Special Transcriptomics Analysis

16. `Advanced 5.` Database Utilization