RNA Sequencing Quality Control (QC)

Recommended Article: 【Bioinformatics】 Table of Contents for Bioinformatics Analysis

1. Experimental QC

2. Data QC

3. Trouble-shooting

a. Genome Projects and Sequencing Techniques

b. Transcriptome Analysis Pipeline

1. Experimental QC (sample level quality control)

⑴ Definition: Measure of tissue quality

⑵ RNA Sequencing Process

① Step 1. RNA Purification: Treat with DNase to remove DNA.

② Step 2. Poly(A) Selection: Enrich polyadenylated RNA.

③ Step 3. Fragmentation: Shear RNA into library-insert sizes of 200–400 nt.

④ Step 4. cDNA Synthesis: Convert RNA into complementary DNA (cDNA).

⑤ Step 5. cDNA Processing: Ligate adaptors, amplify, and add barcodes.

⑥ Step 6. Sequencing: Sequence one or both ends of the fragments, typically 50, 100, or 150 nt per read.

⑦ Step 7. Read Mapping: Align sequencing reads to the genome.

⑶ Type 1. RIN (RNA Integrity Number)

① Background Knowledge: mRNA comprises less than 3% of total RNA. rRNA makes up more than 80% (mainly 28S [2 kb] and 18S [5 kb] in eukaryotic cells).

② Measured by the Agilent 2100 Bioanalyzer.

③ RIN Algorithm: It uses features such as the ratio of 18S to total RNA, the ratio of 28S to total RNA, and the 18S normalized height.

④ RIN = 10: Intact RNA.

⑤ RIN = 1: Completely degraded RNA.

⑥ RIN > 7: Generally considered suitable quality level for RNA-seq.

⑷ Type 2. DV200: For FFPE tissues, measures the percentage of fragments around 200 nt since RNA is fragmented in FFPE tissues

⑸ Type 3. Nucleic Acid Purity Quantification via Absorbance Ratio

① 260 nm / 280 nm Ratio

○ Pure DNA: ~1.8

○ Pure RNA: ~2.0

○ A low 260 nm / 280 nm ratio suggests the presence of proteins or phenol, which absorb at 280 nm.

② 260 nm / 230 nm Ratio

○ Pure DNA/RNA molecules: ~2-2.2

○ A low 260 nm / 230 nm ratio indicates the presence of other contaminants.

⑹ Type 4. Nucleic Acid Weight Quantification

① Extract RNA using miRNeasy Mini Kit (QIAGEN) or similar methods.

② RNA weight criterion: At least 250 ng.

⑺ Type 5. RNA Quality Score (RQS)

⑻ Type 6. ChIP-seq Experimental QC

① ChIP grade monoclonal antibody: Pre-tested with ChIP-seq. 20-30% of the commercially produced antibodies tested were unsatisfactory for ChIP-seq.

② qPCR: Best to test positive control region(s) using the ChIP sample. Hope to detect >10-12x fold enrichment over IgG (non-specific antibody).

③ Biotinylated transcription factor: Permits factor pull-down on streptavidin. Independent of antibodies.

④ CUT&RUN, CUT&Tag, and ChIP-exo are methods to improve the resolution and signal-to-noise ratio of peaks in ChIP-seq.

⑼ Type 7. ATAC-seq Experimental QC

① Tn5 Concentration: Higher Tn5 concentration relative to DNA concentration increases ATAC-seq signal intensity at promoters and enhancers while reducing fragment size.

② Sequencing Lane Cluster Density: Shifts fragment length distribution and TSS enrichment.

⑽ Type 8. Spatial Transcriptomics Experimental QC (ref)

Table. 1. Spatial Transcriptomics Experimental QC

① Fresh-frozen tissue has higher RNA integrity than FFPE, but poorer tissue morphology quality.

② For FFPE samples, DV200 is a more appropriate metric.

③ Image-based spatial transcriptomics platforms like Xenium and CosMx can better tolerate RNA degradation compared to spot-based ST methods.

2. Data QC (sequence level quality control)

⑴ Definition: Evaluation of data quality and improvement as necessary

⑵ Type 1. Data QC Metrics: Used for external validity confirmation by comparing with manuals or other datasets

① QC Metrics

○ In dUTP method, “_1.fastq” represents the first strand (anti-sense), and “_2.fastq” represents the second strand (sense; original RNA sequence)

○ base quality

○ mapping rate

○ mappability filter

○ Type 1. Uniqueness: How unique each sequence is starting at a particular base and of a particular length

○ Type 2. Alignability: How uniquely k-mer sequences align to a region of the genome (up to 2 mismatches allowed)

○ mappability score : S = 1 / # of matches found in genome

○ Long reads can resolve mapping issue among highly similar regions: Some regions of the genome are troublesome regardless of read length.

○ Non-coding RNA ratio

○ High non-coding RNA ratio indicates lower RNA quality.

○ GC content

○ High GC content: Indicates potential rRNA contamination. In this case, filter out 5S, 18S, 28S rRNA

○ Low GC content: Indicates potential issues with reverse transcription

○ Related to CpG island.

○ Read duplicate

○ PCR duplicate: A duplicate that is merely a replication of the same nucleic acid molecule during PCR. High duplication indicates lower RNA quality.

○ If the sequences are exactly the same, they are considered PCR duplicates. If they are just similar, they are considered biological duplicates.

○ In paired-end experiments, duplicates occur at the paired-end level.

○ Generally, DNA-seq involves removing duplicates, while RNA-seq does not: In RNA-seq, the same sequence may repeatedly appear not only due to technical duplicates but also due to high expressing transcripts or short genes. Removing these biological duplicates can reduce the dynamic range of the data or decrease statistical power.

○ The likelihood of increased duplication rate may rise with the number of cycles during the PCR process: By checking the correlation between the duplicate rate and the number of PCR cycles, one can distinguish between technical and biological duplicates.

○ Tools for removing duplicates: Samtools, Picard, Trimmomatic, Trim Galore!, fastp

○ Unique molecule(UMI)

○ RNA quality is significantly lower if the proportion of unique molecules is below 10%

○ Sequencing depth

○ # of reads

○ In cases of alternative splicing or allele-specific expression: >50 million reads are recommended

○ DEG analysis of a ribo-depleted library: Approximately 50-60 million total reads are recommended

○ A ribo-depleted library is recommended to have twice the sequencing depth compared to a poly-A selected library because the ribo-depleted library can capture a wider variety of RNA (e.g., tRNA, rRNA, immature RNA) that the poly-A selected library cannot capture.

○ Sequencing read length

○ If fragments are too small, adaptor binding begins: Adaptor trimming is performed in this case

Figure 1. The reason why adaptors are read when fragments are short

○ Advantages of long-read sequencing compared to short-read sequencing: lower cost per nucleotide, more accurate mapping, ability to identify splice junctions, capability to detect allele-specific expression, ability to resolve repetitive sequences.

○ Disadvantages of long-read sequencing compared to short-read sequencing: higher overall cost, higher cost per read, requires more adaptor trimming. Long-read sequencing (PacBio, Oxford Nanopore) is more likely to contain adapter sequences due to its sequencing method, which involves repetitive reading and single-molecule sequencing.

○ Adaptor sequence

○ Issue 1: Since adapter sequences are artificial, they can cause alignment and variant calling to fail or introduce biases.

○ Issue 2: Adapter sequences contain identical sequences, leading to biases in coverage analysis and differential gene expression (DEG) analysis.

○ To remove adapter sequences, tools such as AdaptorRemoval, Cutadapt, Trimmomatic, and bbduk.sh are used.

○ Exonic ratio

○ For poly-A(+) RNA-seq, exonic region comprises 50 ~ 70% of reads

○ For rRNA(-) RNA-seq, the proportion of exonic reads is reduced

○ Paired-end vs Single-end

○ Single end reads: Each library fragment is sequenced only from a single end.

○ Paired end reads: Each library fragment is sequenced from both ends.

○ Paired-end (PE) reads are more accurate than single-end (SE) reads but are approximately twice as expensive.

○ If the goal is simply to calculate gene counts for DEG analysis, SE is sufficient.

○ SE is recommended when RNA is significantly degraded.

○ It’s better to avoid using PE on short fragments to prevent inefficiencies caused by sequencing the same nucleotide.

Figure 2. Inefficiencies that can occur in paired-end (PE) sequencing

○ Mate pair reads

○ A longer segment of DNA is circularized and reads from the joint region are sequenced (both ends).

○ Strand-specific (ssRNA-seq)

○ Can be either forward or reverse.

○ Poly-A selection vs Ribo-depletion

○ Advantages of a ribo-depletion library

○ Works even with RNA degradation: cDNA fragments are uneven and short. Poly-A selection is highly biased toward the 3’ end, making it less accurate.

○ Suitable for studying non-coding RNA.

○ Disadvantages of a ribo-depletion library

○ Expensive.

○ Includes a large number of meaningless reads.

② QC Metrics for ChIP-seq

○ Mapping ratio

○ Read depth: ENCODE recommends ≥10 million uniquely mapped reads for TFs (histone modifications).

○ Library complexity

○ Background uniformity (biasedness)

○ GC summit bias

○ qPCR enrichment

○ Fragment size distribution

○ Input DNA qualuty via NanoDrop

○ Cross-correlation analysis: NSC (normalized strand coefficient), RSC (relative strand correlation)

○ FRiP (fraction of reads in peaks) (ref1, ref2), RUP(reads under peaks): Proportion of reads in a ChIP-seq dataset that falls into a peak. ENCODE recommends FRiP (RUP) ≥ 1%.

○ SPOT(signal portion in tags): Indicates good signal-to-noise ratio.

○ IDR (irreproducibility discovery rate) (ref1, ref2)

○ denQCi, simQCi, QC-STAMP (ref)

○ Motif analysis: What % of peaks contain the TF motif, and does the motif tend to occur in the middle of the peak? Not expected for all peaks, because a TF may bind as part of a protein complex or a heterodimer.

③ QC Metrics for ATAC-seq

○ FastQC: For example, “Per base sequence content” can be used to assess the integration bias of Tn5 transposase.

○ The ataqv package provides 35 QC metrics as follows (ref): fragment length distribution, % reads that are high-quality and autosomal, % reads properly paired end mapped, % reads that aligned to autosomes that were duplicates, short-to-mononucleosomal-ratio, TSS enrichment, duplicate fraction in peaks, duplicate fraction outside of peaks, peak duplicate ratio, cumulative fraction of high-quality autosomal reads in peaks, cumulative fraction of the genome that falls within peaks, distribution of mapping qualities, number of total reads, % alignments marked secondary, % alignments marked supplementary, % alignments marked as duplicates, mean mapping quality, median mapping quality, % reads unmapped, % reads with an unmapped mate, % QC-fail reads,% unpaired reads, % reads with mapping quality 0, % reads that paired and mapped but in RF orientation, % reads that paired and mapped but in FF orientation, % reads that paired and mapped but in RR orientation, % reads that paired and mapped but on separate chromosomes, % reads that paired and mapped but too far from mate, % reads that paired and mapped but not properly, % reads that aligned to autosomes, % reads that aligned to mitochondria, % reads that aligned to mitochondria that were duplicate, number of peaks called, fragment length distribution distance, max fraction of reads from a single autosome

④ Method 1. Other Datasets: 10x Genomics, GEO, ZENODO, etc.

⑤ Method 2. FastQC

○ 2-1. FastQC and multiQC: Most popular

○ Base pair quality of reads

○ Adaptor sequences in reads

○ PCR duplicates

○ Overrepresented sequences

○ GC distribution for each sample

○ 2-2. QoRTs (ref1, ref2): Very good

○ RNA degradation: Distribution of reads 5’ → 3’

○ Strandedness check

○ GC bias

○ 2-3. RNASeQC: Decent

○ 2-4. RSeQC: Used to have major bugs

○ 2-5. Use conda Fastqc command (Linux)

○ 2-6. Download SRA (Sequence Reads Archive) Toolkit and use fastqc command (Linux)

○ Below is an example of generated files.

sudo apt install fastqc
cd sratoolkit.3.0.5-ubuntu64/
cd bin
fastqc DRR016938.fastq

⑥ Method 3. Trimmomatic: Takes Fastq files as input.

⑦ Method 4. FASTX-Toolkit: Takes Fastq files as input.

⑧ Method 5. QC after mapping: Takes SAM or BAM files as input.

○ QC metric

○ % uniquely mapped reads

○ % reads mapping to exons

○ Complexity, i.e. x% of read counts being taken up by y% of genes

○ ○ Consistency across samples

○ Sample swap: Match Y chromosome, Xist, Genotype (e.g., SNP) with metadata.

○ 5-1. Qplot

○ 5-2. Samtools

⑨ Method 6. SnakeMake: Integrated pipeline providing QC functionality as well

○ Snakefile: A Snakemake script based on Python. The filename itself is Snakefile.

# Snakefile

# Define output file
rule all:
    input:
        "results/processed_data.tsv"

# Rule for data processing 
rule process_data:
    input:
        "data/raw_data.tsv"
    output:
        "results/processed_data.tsv"
    shell:
        """
        cat {input} | awk -F'\t' '' > {output}
        """

○ config.yaml (optional): Setting for Snakemake workflow (ref)

○ requirements.txt (optional): Package dependency

○ Input file

○ Output file

⑩ Method 7. QuASAR-QC: Applicable for Hi-C data

⑪ Troubleshootings

⑵ Type 1. Rank-correlation between samples: Used for internal validity confirmation

① Objective 1. Evaluating sample quality by examining alignment of two variables with alignment characteristics within a single sample

○ Example 1. Examining the alignment of expression levels of two genes known to be similar

○ Example 2. Investigating whether expression of two genes known to be similar appears in the same cluster

② Objective 2. Mainly used to observe correspondence between a pair of identical samples

③ Objective 3. Examining correlation coefficients of two different variables with different data distribution characteristics

○ Somewhat distant from QC analysis

○ Example: Investigating the correlation coefficient between gene A expression in scRNA-seq and gene A expression in ST

④ Method 1. Pearson Correlation Coefficient

○ Definition: Given the standard deviations σx, σy of X and Y,

○ Calculation in RStudio

○ cor(x, y)

○ cor(x, y, method = "pearson")

○ cor.test(x, y)

○ cor.test(x, y, method = "pearson")

⑤ Method 2. Spearman’s Rank Correlation Coefficient

○ Definition: Defined based on ranks x’ = rank(x) and y’ = rank(y)

○ Calculation in RStudio

○ cor(x, y, method = "spearman")

○ cor.test(x, y, method = "spearman")

⑥ Method 3. Kendall’s Rank Correlation Coefficient

○ Definition: Defined based on concordant pairs and discordant pairs

○ Step 1. Sort y values in ascending order for x values: Represent each y value as yi

○ Step 2. Count the number of concordant pairs where yj > yi (where j > i) for each yi value

○ Step 3. Count the number of discordant pairs where yj < yi (where j > i) for each yi value

○ Step 4. Definition of correlation coefficient

○ n_c: total number of concordant pairs

○ n_d: total number of discordant pairs

○ n: size of x and y

○ Calculation in RStudio

○ cor(x, y, method = "kendall")

○ cor.test(x, y, method = "kendall")

⑦ Method 4. Q-Q plot between empirical CDFs (Cumulative Distribution Functions)

⑧ Method 5. Q-Q plot between ordered p-values

⑨ For Hi-C sequencing, available methods include HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep.

3. Troubleshooting

⑴ Method 1. Website Investigation: Pre-flight errors, in-flight errors, or alerts

① Failing to install bcl2fastq

② ATAC Sequencing depth per cell is low ( Cell Ranger ARC v2.0 ): Ideal > 10,000. Low ATAC sequencing depth negatively impacts the quality of peak calling, clustering, differential analysis and feature linkage. At very low sequencing depth, < 5000 raw read-pairs per cell, identification of cell barcodes may be unreliable.

③ GEX Sequencing depth per cell is low ( Cell Ranger ARC v2.0 ): Ideal > 5,000. Low GEX sequencing depth negatively impacts the quality of clustering, differential analysis and feature linkage. At very low sequencing depth, < 2,000 raw read-pairs per cell, identification of cell barcodes may be unreliable.

④ ATAC Median fragments per cell is low ( Cell Ranger ARC v2.0 ): A low value is generally caused by low sequence depth, the wrong genome reference, or low library complexity that could be due to a problem during the transposition step or a problem in the library preparation workflow. Low fragment counts negatively impact clustering, differential analysis and feature linkage detection.

⑤ Number of linkages detected is low ( Cell Ranger ARC v2.0 ): The number of detected feature linkage is < 100. This may be caused by a low number of nuclei recovered, low sequencing depth, poor peak calling, or a sample that is relatively homogenous.

⑥ GEX Median UMI counts per cell is low ( Cell Ranger ARC v2.0 ): Observed value < 100. This may be a consequence of very low sequencing depth, poor sample quality, an error in the library preparation workflow, the wrong reference genome, or poor genome annotations. Low UMI counts negatively impact clustering, differential analysis and feature linkage detection.

⑦ GEX Reads mapping to reference is low ( Cell Ranger ARC v2.0 ): Ideal > 80%. This can be caused by the wrong reference genome being used or a poor quality genome assembly. Application performance may be affected.

⑧ GEX Reads mapping to transcriptome is low ( Cell Ranger ARC v2.0 ): Ideal > 50%. This can indicate use of the wrong reference transcriptome, a reference transcriptome with overlapping genes, poor library quality, poor sequencing quality, or reads shorter than the recommended minimum. Application performance may be affected.

⑨ ATAC Reads mapping to reference is low ( Cell Ranger ARC v2.0 ): Ideal > 80%. This can be caused by the wrong reference genome being used or a poor quality genome assembly. Application performance may be affected.

⑩ GEX Transcriptome reads in cells is low ( Cell Ranger ARC v2.0 ): Ideal > 60%. Many of the reads were not assigned to cell-associated barcodes. This is generally indicative of poor sample prep resulting in high levels of ambient RNA. It could also indicate a problem in the cell calling algorithm that could be caused by high RNA or DNA background, exclusion of a large number of barcodes from cell calling due to low targeting, or due to a population of nuclei with low RNA content. The latter case can be addressed by inspecting the data to determine the appropriate cell count and rerunning the pipeline supplying appropriate parameters to override the cell caller. Application performance may be affected.

⑪ **Low Fraction Reads Confidently Mapped To Transcriptome **( Cell Ranger v6.1 ): Ideal > 30%. This can indicate use of the wrong reference transcriptome, a reference transcriptome with overlapping genes, poor library quality, poor sequencing quality, or reads shorter than the recommended minimum. Application performance may be affected.

⑫ No Cells Detected ( Cell Ranger v6.1 ): Estimated number of cells is expected to be > 100. This usually indicates poor cell handling, poor library, or poor sequencing quality. Application performance is likely to be affected.

⑬ Low Fraction Valid UMIs ( Cell Ranger v6.1 ): Ideal > 75%. This may indicate a quality issue with the Illumina R2 read for Single Cell 3’ v1 or the R1 read for Single Cell 3’ v2/v3 and Single Cell 5’. Application performance may be affected.

⑭ Fraction of UMI bases with Q-score >= 30 is low ( Cell Ranger v6.1 ): Fraction of UMI bases (Illumina R2 Read for Single Cell 3’ v1, R1 for Single Cell 3’ v2/v3 and Single Cell 5’) with Q-score >= 30 should be above 75%. A lower fraction might indicate poor sequencing quality.

⑮ Fraction of cell barcode bases with Q-score >= 30 is low ( Cell Ranger v6.1 ): Fraction of cell barcode bases (Illumina I7 Read for Single Cell 3’ v1, R1 for Single Cell 3’ v2/v3 and Single Cell 5’) with Q-score >= 30 should be above 55%. A lower fraction might indicate poor sequencing quality.

ⓐ Too many detected cells ( Cell Ranger ATAC v2.0 ): Estimated number of cells is expected to be under 10,000. A high value might indicate an overlapping of cells, a problem during library preparation, or unexpected behavior in the cell calling algorithm.

ⓑ Average fraction of barcode bases with high sequencing quality is low ( Cell Ranger ATAC v2.0 ): Average fraction of bases in barcode with quality above Q30 should be ideally above 75%. A lower fraction might indicate poor sequencing quality.

ⓒ Median fragments per cell is low ( Cell Ranger ATAC v2.0 ): The median number of fragments (that passed all filters) detected in single cells is expected to be above 500. A lower value suggests low sensitivity, potentially due to insufficient sequencing.

ⓓ ** The percentage of transposition events falling within peaks is low** ( Cell Ranger ATAC v2.0 ): It is expected that more than 25% of the transposition events fall within peak regions. A lower value could suggest peak undercalling or low sequencing depth.

ⓔ Estimated number of cells is low ( Cell Ranger ATAC v2.0 ): Number of cells detected is expected to be higher than 500. This usually indicates poor cell, library, or sequencing quality.

ⓕ Average fraction of barcode bases with high sequencing quality is low ( Cell Ranger ATAC v2.0 ): Average fraction of bases in barcode with quality above Q30 should be above 75%. A lower fraction might indicate poor sequencing quality.

ⓖ Fraction of RNA read bases with Q-score >= 30 is low ( Space Ranger v1.3 ): Fraction of RNA read bases with Q-score >= 30 should be above 80%. A lower fraction might indicate poor sequencing quality.

ⓗ Low Fraction Reads in Spots ( Space Ranger v1.3 ): Ideal > 50%. Application performance may be affected. Many of the reads were not assigned to tissue covered spots. This could be caused by high levels of ambient RNA resulting from inefficient permeabilization or because of poor tissue detection. The latter case can be addressed by using the manual tissue selection option through Loupe.

⑵ Method 2. search for technical note

① Single Cell Gene Expression Assay

② Single Cell Multiome ATAC + Gene Expression Assay

③ Single ATAC Assay

④ Visium Assay

⑤ Visium Assay2

Input: 2023.05.22 11:48

2337

RNA Sequencing Quality Control (QC)

1. Experimental QC (sample level quality control)

2. Data QC (sequence level quality control)

3. Troubleshooting

results matching ""

No results matching ""