Korean, Edit

RNA Sequencing Quality Control (QC)

Recommended Article: 【Bioinformatics】 Table of Contents for Bioinformatics Analysis


1. Tissue QC

2. Data QC

3. Trouble-shooting


a. Genome Projects and Sequencing Techniques

b. Transcriptome Analysis Pipeline



1. Tissue QC (sample level quality control)

⑴ Definition: Measure of tissue quality

Type 1. RIN (RNA Integrity Number)

① Background Knowledge: mRNA comprises less than 3% of total RNA. rRNA makes up more than 80% (mainly 28S [2 kb] and 18S [5 kb] in eukaryotic cells).

② Measured by the Agilent 2100 Bioanalyzer.

③ RIN Algorithm: It uses features such as the ratio of 18S to total RNA, the ratio of 28S to total RNA, and the 18S normalized height.

④ RIN = 10: Intact RNA.

⑤ RIN = 1: Completely degraded RNA.

⑥ RIN > 7: Generally considered suitable quality level for RNA-seq.

Type 2. DV200: For FFPE tissues, measures the percentage of fragments around 200 nt since RNA is fragmented in FFPE tissues

Type 3. Nucleic Acid Purity Quantification via Absorbance Ratio

① 260 nm / 280 nm Ratio

○ Pure DNA: ~1.8

○ Pure RNA: ~2.0

○ A low 260 nm / 280 nm ratio suggests the presence of proteins or phenol, which absorb at 280 nm.

② 260 nm / 230 nm Ratio

○ Pure DNA/RNA molecules: ~2-2.2

○ A low 260 nm / 230 nm ratio indicates the presence of other contaminants.



2. Data QC (sequence level quality control)

⑴ Definition: Evaluation of data quality and improvement as necessary

Type 1. Data QC Metrics: Used for external validity confirmation by comparing with manuals or other datasets

① QC Metrics

○ In dUTP method, “_1.fastq” represents the first strand (anti-sense), and “_2.fastq” represents the second strand (sense; original RNA sequence)

base quality

mapping rate

Non-coding RNA ratio

○ High non-coding RNA ratio indicates lower RNA quality.

GC content

○ High GC content: Indicates potential rRNA contamination. In this case, filter out 5S, 18S, 28S rRNA

○ Low GC content: Indicates potential issues with reverse transcription

Read duplicate

○ PCR duplicate: A duplicate that is merely a replication of the same nucleic acid molecule during PCR. High duplication indicates lower RNA quality.

○ If the sequences are exactly the same, they are considered PCR duplicates. If they are just similar, they are considered biological duplicates.

○ In paired-end experiments, duplicates occur at the paired-end level.

○ Generally, DNA-seq involves removing duplicates, while RNA-seq does not: In RNA-seq, the same sequence may repeatedly appear not only due to technical duplicates but also due to high expressing transcripts or short genes. Removing these biological duplicates can reduce the dynamic range of the data or decrease statistical power.

○ The likelihood of increased duplication rate may rise with the number of cycles during the PCR process: By checking the correlation between the duplicate rate and the number of PCR cycles, one can distinguish between technical and biological duplicates.

○ Tools for removing duplicates: Samtools, Picard

Unique molecule

○ RNA quality is significantly lower if the proportion of unique molecules is below 10%

Sequencing depth

○ # of reads

○ In cases of alternative splicing or allele-specific expression: >50 million reads are recommended

○ DEG analysis of a ribo-depleted library: Approximately 50-60 million total reads are recommended

○ A ribo-depleted library is recommended to have twice the sequencing depth compared to a poly-A selected library because the ribo-depleted library can capture a wider variety of RNA (e.g., tRNA, rRNA, immature RNA) that the poly-A selected library cannot capture.

Sequencing read length

○ If fragments are too small, adaptor binding begins: Adaptor trimming is performed in this case

○ Advantages of long-read sequencing compared to short-read sequencing: lower cost per nucleotide, more accurate mapping, ability to identify splice junctions, capability to detect allele-specific expression, ability to resolve repetitive sequences.

○ Disadvantages of long-read sequencing compared to short-read sequencing: higher overall cost, higher cost per read, requires more adaptor trimming.

Exonic ratio

○ For poly-A(+) RNA-seq, exonic region comprises 50 ~ 70% of reads

○ For rRNA(-) RNA-seq, the proportion of exonic reads is reduced

Paired-end vs Single-end

○ Paired-end (PE) reads are more accurate than single-end (SE) reads but are approximately twice as expensive.

○ If the goal is simply to calculate gene counts for DEG analysis, SE is sufficient.

○ SE is recommended when RNA is significantly degraded.

○ It’s better to avoid using PE on short fragments to prevent inefficiencies caused by sequencing the same nucleotide.


스크린샷 2025-01-20 오후 8 30 14

Figure 1. Inefficiencies that can occur in paired-end (PE) sequencing


Poly-A selection vs Ribo-depletion

○ Advantages of a ribo-depletion library

○ Works even with RNA degradation: cDNA fragments are uneven and short. Poly-A selection is highly biased toward the 3’ end, making it less accurate.

○ Suitable for studying non-coding RNA.

○ Disadvantages of a ribo-depletion library

○ Expensive.

○ Includes a large number of meaningless reads.

② QC Metrics for ChIP-seq

Mapping ratio

Read depth

Library complexity

Background uniformity (biasedness)

GC summit bias

○ qPCR enrichment

Fragment size distribution

○ Input DNA qualuty via NanoDrop

Cross-correlation analysis: NSC (normalized strand coefficient), RSC (relative strand correlation)

○ FRiP (fraction of reads in peaks) (ref1, ref2)

○ IDR (irreproducibility discovery rate) (ref1, ref2)

○ denQCi, simQCi, QC-STAMP (ref)

Method 1. Other Datasets: 10x Genomics, GEO, ZENODO, etc.

Method 2. FastQC

2-1. Use FastQC program

2-2. Use conda Fastqc command (Linux)

2-3. Download SRA (Sequence Reads Archive) Toolkit and use fastqc command (Linux)

○ Below is an example of generated files.


sudo apt install fastqc
cd sratoolkit.3.0.5-ubuntu64/
cd bin
fastqc DRR016938.fastq


Method 3. Trimmomatic: Takes Fastq files as input.

Method 4. FASTX-Toolkit: Takes Fastq files as input.

Method 5. QC after mapping: Takes SAM or BAM files as input.

○ QC metric

○ % uniquely mapped reads

○ % reads mapping to exons

○ Complexity, i.e. x% of read counts being taken up by y% of genes

○ ○ Consistency across samples

○ Sample swap: Match Y chromosome, Xist, Genotype (e.g., SNP) with metadata.

5-1. Qplot

5-2. Samtools

Method 7. SnakeMake: Integrated pipeline providing QC functionality as well

Snakefile: A Snakemake script based on Python. The filename itself is Snakefile.


# Snakefile

# Define output file
rule all:
    input:
        "results/processed_data.tsv"

# Rule for data processing 
rule process_data:
    input:
        "data/raw_data.tsv"
    output:
        "results/processed_data.tsv"
    shell:
        """
        cat {input} | awk -F'\t' '' > {output}
        """


config.yaml (optional): Setting for Snakemake workflow (ref)

requirements.txt (optional): Package dependency

○ Input file

○ Output file

Method 8. QuASAR-QC: Applicable for Hi-C data

Troubleshootings

Type 1. Rank-correlation between samples: Used for internal validity confirmation

Objective 1. Evaluating sample quality by examining alignment of two variables with alignment characteristics within a single sample

Example 1. Examining the alignment of expression levels of two genes known to be similar

Example 2. Investigating whether expression of two genes known to be similar appears in the same cluster

Objective 2. Mainly used to observe correspondence between a pair of identical samples

Objective 3. Examining correlation coefficients of two different variables with different data distribution characteristics

○ Somewhat distant from QC analysis

○ Example: Investigating the correlation coefficient between gene A expression in scRNA-seq and gene A expression in ST

Method 1.

○ Definition: Given the standard deviations σx, σy of X and Y,


image


Calculation in RStudio

○ cor(x, y)

○ cor(x, y, method = “pearson”)

○ cor.test(x, y)

○ cor.test(x, y, method = “pearson”)

Method 2. Spearman’s Rank Correlation Coefficient

○ Definition: Defined based on ranks x’ = rank(x) and y’ = rank(y)


image


Calculation in RStudio

○ cor(x, y, method = “spearman”)

○ cor.test(x, y, method = “spearman”)

Method 3. Kendall’s Rank Correlation Coefficient

○ Definition: Defined based on concordant pairs and discordant pairs

Step 1. Sort y values in ascending order for x values: Represent each y value as yi

Step 2. Count the number of concordant pairs where yj > yi (where j > i) for each yi value

Step 3. Count the number of discordant pairs where yj < yi (where j > i) for each yi value

Step 4. Definition of correlation coefficient


image


○ nc: total number of concordant pairs

○ nd: total number of discordant pairs

○ n: size of x and y

Calculation in RStudio

○ cor(x, y, method = “kendall”)

○ cor.test(x, y, method = “kendall”)



3. Troubleshooting

Method 1. Website Investigation: Pre-flight errors, in-flight errors, or alerts

Failing to install bcl2fastq

ATAC Sequencing depth per cell is low ( Cell Ranger ARC v2.0 ): Ideal > 10,000. Low ATAC sequencing depth negatively impacts the quality of peak calling, clustering, differential analysis and feature linkage. At very low sequencing depth, < 5000 raw read-pairs per cell, identification of cell barcodes may be unreliable.

GEX Sequencing depth per cell is low ( Cell Ranger ARC v2.0 ): Ideal > 5,000. Low GEX sequencing depth negatively impacts the quality of clustering, differential analysis and feature linkage. At very low sequencing depth, < 2,000 raw read-pairs per cell, identification of cell barcodes may be unreliable.

ATAC Median fragments per cell is low ( Cell Ranger ARC v2.0 ): A low value is generally caused by low sequence depth, the wrong genome reference, or low library complexity that could be due to a problem during the transposition step or a problem in the library preparation workflow. Low fragment counts negatively impact clustering, differential analysis and feature linkage detection.

Number of linkages detected is low ( Cell Ranger ARC v2.0 ): The number of detected feature linkage is < 100. This may be caused by a low number of nuclei recovered, low sequencing depth, poor peak calling, or a sample that is relatively homogenous.

GEX Median UMI counts per cell is low ( Cell Ranger ARC v2.0 ): Observed value < 100. This may be a consequence of very low sequencing depth, poor sample quality, an error in the library preparation workflow, the wrong reference genome, or poor genome annotations. Low UMI counts negatively impact clustering, differential analysis and feature linkage detection.

GEX Reads mapping to reference is low ( Cell Ranger ARC v2.0 ): Ideal > 80%. This can be caused by the wrong reference genome being used or a poor quality genome assembly. Application performance may be affected.

GEX Reads mapping to transcriptome is low ( Cell Ranger ARC v2.0 ): Ideal > 50%. This can indicate use of the wrong reference transcriptome, a reference transcriptome with overlapping genes, poor library quality, poor sequencing quality, or reads shorter than the recommended minimum. Application performance may be affected.

ATAC Reads mapping to reference is low ( Cell Ranger ARC v2.0 ): Ideal > 80%. This can be caused by the wrong reference genome being used or a poor quality genome assembly. Application performance may be affected.

GEX Transcriptome reads in cells is low ( Cell Ranger ARC v2.0 ): Ideal > 60%. Many of the reads were not assigned to cell-associated barcodes. This is generally indicative of poor sample prep resulting in high levels of ambient RNA. It could also indicate a problem in the cell calling algorithm that could be caused by high RNA or DNA background, exclusion of a large number of barcodes from cell calling due to low targeting, or due to a population of nuclei with low RNA content. The latter case can be addressed by inspecting the data to determine the appropriate cell count and rerunning the pipeline supplying appropriate parameters to override the cell caller. Application performance may be affected.

⑪ **Low Fraction Reads Confidently Mapped To Transcriptome **( Cell Ranger v6.1 ): Ideal > 30%. This can indicate use of the wrong reference transcriptome, a reference transcriptome with overlapping genes, poor library quality, poor sequencing quality, or reads shorter than the recommended minimum. Application performance may be affected.

No Cells Detected ( Cell Ranger v6.1 ): Estimated number of cells is expected to be > 100. This usually indicates poor cell handling, poor library, or poor sequencing quality. Application performance is likely to be affected.

Low Fraction Valid UMIs ( Cell Ranger v6.1 ): Ideal > 75%. This may indicate a quality issue with the Illumina R2 read for Single Cell 3’ v1 or the R1 read for Single Cell 3’ v2/v3 and Single Cell 5’. Application performance may be affected.

Fraction of UMI bases with Q-score >= 30 is low ( Cell Ranger v6.1 ): Fraction of UMI bases (Illumina R2 Read for Single Cell 3’ v1, R1 for Single Cell 3’ v2/v3 and Single Cell 5’) with Q-score >= 30 should be above 75%. A lower fraction might indicate poor sequencing quality.

Fraction of cell barcode bases with Q-score >= 30 is low ( Cell Ranger v6.1 ): Fraction of cell barcode bases (Illumina I7 Read for Single Cell 3’ v1, R1 for Single Cell 3’ v2/v3 and Single Cell 5’) with Q-score >= 30 should be above 55%. A lower fraction might indicate poor sequencing quality.

Too many detected cells ( Cell Ranger ATAC v2.0 ): Estimated number of cells is expected to be under 10,000. A high value might indicate an overlapping of cells, a problem during library preparation, or unexpected behavior in the cell calling algorithm.

Average fraction of barcode bases with high sequencing quality is low ( Cell Ranger ATAC v2.0 ): Average fraction of bases in barcode with quality above Q30 should be ideally above 75%. A lower fraction might indicate poor sequencing quality.

Median fragments per cell is low ( Cell Ranger ATAC v2.0 ): The median number of fragments (that passed all filters) detected in single cells is expected to be above 500. A lower value suggests low sensitivity, potentially due to insufficient sequencing.

ⓓ ** The percentage of transposition events falling within peaks is low** ( Cell Ranger ATAC v2.0 ): It is expected that more than 25% of the transposition events fall within peak regions. A lower value could suggest peak undercalling or low sequencing depth.

Estimated number of cells is low ( Cell Ranger ATAC v2.0 ): Number of cells detected is expected to be higher than 500. This usually indicates poor cell, library, or sequencing quality.

Average fraction of barcode bases with high sequencing quality is low ( Cell Ranger ATAC v2.0 ): Average fraction of bases in barcode with quality above Q30 should be above 75%. A lower fraction might indicate poor sequencing quality.

Fraction of RNA read bases with Q-score >= 30 is low ( Space Ranger v1.3 ): Fraction of RNA read bases with Q-score >= 30 should be above 80%. A lower fraction might indicate poor sequencing quality.

Low Fraction Reads in Spots ( Space Ranger v1.3 ): Ideal > 50%. Application performance may be affected. Many of the reads were not assigned to tissue covered spots. This could be caused by high levels of ambient RNA resulting from inefficient permeabilization or because of poor tissue detection. The latter case can be addressed by using the manual tissue selection option through Loupe.

Method 2. search for technical note

Single Cell Gene Expression Assay

Single Cell Multiome ATAC + Gene Expression Assay

Single ATAC Assay

Visium Assay

Visium Assay2



Input: 2023.05.22 11:48

results matching ""

    No results matching ""