Korean, Edit

RNA Sequencing Quality Control (QC)

Recommended Article : 【Bioinformatics】 Table of Contents for Bioinformatics Analysis


1. Tissue QC

2. Data QC

3. Trouble-shooting


a. Genome Projects and Sequencing Techniques

b. Transcriptome Analysis Pipeline



1. Tissue QC (quality control)

⑴ Definition : Measure of tissue quality

Type 1. RIN (RNA Integrity Number)

① Measured by Agilent 2100 Bioanalyzer

② RIN = 10 : Intact RNA

③ RIN = 1 : Completely degraded RNA

④ RIN > 7 : Generally suitable quality for RNA-seq

Type 2. DV200 : For FFPE tissues, measures the percentage of fragments around 200 nt since RNA is fragmented in FFPE tissues



2. Data QC (quality control)

⑴ Definition : Evaluation of data quality and improvement as necessary

Type 1. Data QC Metrics : Used for external validity confirmation by comparing with manuals or other datasets

① QC Metrics

○ In dUTP method, “_1.fastq” represents the first strand (anti-sense), and “_2.fastq” represents the second strand (sense; original RNA sequence)

○ High non-coding RNA ratio and high duplication indicate lower RNA quality

○ High GC content : Indicates potential rRNA contamination. In this case, filter out 5S, 18S, 28S rRNA

○ Low GC content : Indicates potential issues with reverse transcription

○ Typically, duplicates are removed in DNA-seq, but not in RNA-seq

○ RNA quality is significantly lower if the proportion of unique molecules is below 10%

○ If fragments are too small, adaptor binding begins : Adaptor trimming is performed in this case

○ For poly-A(+) RNA-seq, exonic region comprises 50 ~ 70% of reads

○ For rRNA(-) RNA-seq, the proportion of exonic reads is reduced

② QC Metrics for ChIP-seq

Mapping ratio

Read depth

Library complexity

Background uniformity (biasedness)

GC summit bias

○ qPCR enrichment

Fragment size distribution

○ Input DNA qualuty via NanoDrop

Cross-correlation analysis: NSC (normalized strand coefficient), RSC (relative strand correlation)

○ FRiP (fraction of reads in peaks) (ref1, ref2)

○ IDR (irreproducibility discovery rate) (ref1, ref2)

○ denQCi, simQCi, QC-STAMP (ref)

Method 1. Other Datasets : 10x Genomics, GEO, ZENODO, etc.

Method 2. Use FastQC program

Method 3. Use conda Fastqc command (Linux)

Method 4. Download SRA (Sequence Reads Archive) Toolkit and use fastqc command (Linux) : Below is an example of generated files


sudo apt install fastqc
cd sratoolkit.3.0.5-ubuntu64/
cd bin
fastqc DRR016938.fastq


Method 5. SnakeMake : Integrated pipeline providing QC functionality as well


# Snakefile

# Define output file
rule all:
    input:
        "results/processed_data.tsv"

# Rule for data processing 
rule process_data:
    input:
        "data/raw_data.tsv"
    output:
        "results/processed_data.tsv"
    shell:
        """
        cat {input} | awk -F'\t' '' > {output}
        """


config.yaml (optional): Setting for Snakemake workflow (ref)

requirements.txt (optional): Package dependency

○ Input file

○ Output file

Method 6. QuASAR-QC: Applicable for Hi-C data

Troubleshootings

Type 1. Rank-correlation between samples : Used for internal validity confirmation

Objective 1. Evaluating sample quality by examining alignment of two variables with alignment characteristics within a single sample

Example 1. Examining the alignment of expression levels of two genes known to be similar

Example 2. Investigating whether expression of two genes known to be similar appears in the same cluster

Objective 2. Mainly used to observe correspondence between a pair of identical samples

Objective 3. Examining correlation coefficients of two different variables with different data distribution characteristics

○ Somewhat distant from QC analysis

○ Example : Investigating the correlation coefficient between gene A expression in scRNA-seq and gene A expression in ST

Method 1.

○ Definition : Given the standard deviations σx, σy of X and Y,


image


Calculation in RStudio

○ cor(x, y)

○ cor(x, y, method = “pearson”)

○ cor.test(x, y)

○ cor.test(x, y, method = “pearson”)

Method 2. Spearman’s Rank Correlation Coefficient

○ Definition : Defined based on ranks x’ = rank(x) and y’ = rank(y)


image


Calculation in RStudio

○ cor(x, y, method = “spearman”)

○ cor.test(x, y, method = “spearman”)

Method 3. Kendall’s Rank Correlation Coefficient

○ Definition : Defined based on concordant pairs and discordant pairs

Step 1. Sort y values in ascending order for x values : Represent each y value as yi

Step 2. Count the number of concordant pairs where yj > yi (where j > i) for each yi value

Step 3. Count the number of discordant pairs where yj < yi (where j > i) for each yi value

Step 4. Definition of correlation coefficient


image


○ nc : total number of concordant pairs

○ nd : total number of discordant pairs

○ n : size of x and y

Calculation in RStudio

○ cor(x, y, method = “kendall”)

○ cor.test(x, y, method = “kendall”)



3. Troubleshooting

Method 1. Website Investigation : Pre-flight errors, in-flight errors, or alerts

Failing to install bcl2fastq

ATAC Sequencing depth per cell is low ( Cell Ranger ARC v2.0 ) : Ideal > 10,000. Low ATAC sequencing depth negatively impacts the quality of peak calling, clustering, differential analysis and feature linkage. At very low sequencing depth, < 5000 raw read-pairs per cell, identification of cell barcodes may be unreliable.

GEX Sequencing depth per cell is low ( Cell Ranger ARC v2.0 ) : Ideal > 5,000. Low GEX sequencing depth negatively impacts the quality of clustering, differential analysis and feature linkage. At very low sequencing depth, < 2,000 raw read-pairs per cell, identification of cell barcodes may be unreliable.

ATAC Median fragments per cell is low ( Cell Ranger ARC v2.0 ) : A low value is generally caused by low sequence depth, the wrong genome reference, or low library complexity that could be due to a problem during the transposition step or a problem in the library preparation workflow. Low fragment counts negatively impact clustering, differential analysis and feature linkage detection.

Number of linkages detected is low ( Cell Ranger ARC v2.0 ) : The number of detected feature linkage is < 100. This may be caused by a low number of nuclei recovered, low sequencing depth, poor peak calling, or a sample that is relatively homogenous.

GEX Median UMI counts per cell is low ( Cell Ranger ARC v2.0 ) : Observed value < 100. This may be a consequence of very low sequencing depth, poor sample quality, an error in the library preparation workflow, the wrong reference genome, or poor genome annotations. Low UMI counts negatively impact clustering, differential analysis and feature linkage detection.

GEX Reads mapping to reference is low ( Cell Ranger ARC v2.0 ) : Ideal > 80%. This can be caused by the wrong reference genome being used or a poor quality genome assembly. Application performance may be affected.

GEX Reads mapping to transcriptome is low ( Cell Ranger ARC v2.0 ) : Ideal > 50%. This can indicate use of the wrong reference transcriptome, a reference transcriptome with overlapping genes, poor library quality, poor sequencing quality, or reads shorter than the recommended minimum. Application performance may be affected.

ATAC Reads mapping to reference is low ( Cell Ranger ARC v2.0 ) : Ideal > 80%. This can be caused by the wrong reference genome being used or a poor quality genome assembly. Application performance may be affected.

GEX Transcriptome reads in cells is low ( Cell Ranger ARC v2.0 ) : Ideal > 60%. Many of the reads were not assigned to cell-associated barcodes. This is generally indicative of poor sample prep resulting in high levels of ambient RNA. It could also indicate a problem in the cell calling algorithm that could be caused by high RNA or DNA background, exclusion of a large number of barcodes from cell calling due to low targeting, or due to a population of nuclei with low RNA content. The latter case can be addressed by inspecting the data to determine the appropriate cell count and rerunning the pipeline supplying appropriate parameters to override the cell caller. Application performance may be affected.

Low Fraction Reads Confidently Mapped To Transcriptome **( Cell Ranger v6.1 ) **: Ideal > 30%. This can indicate use of the wrong reference transcriptome, a reference transcriptome with overlapping genes, poor library quality, poor sequencing quality, or reads shorter than the recommended minimum. Application performance may be affected.

No Cells Detected ( Cell Ranger v6.1 ) : Estimated number of cells is expected to be > 100. This usually indicates poor cell handling, poor library, or poor sequencing quality. Application performance is likely to be affected.

Low Fraction Valid UMIs ( Cell Ranger v6.1 ) : Ideal > 75%. This may indicate a quality issue with the Illumina R2 read for Single Cell 3’ v1 or the R1 read for Single Cell 3’ v2/v3 and Single Cell 5’. Application performance may be affected.

Fraction of UMI bases with Q-score >= 30 is low ( Cell Ranger v6.1 ) : Fraction of UMI bases (Illumina R2 Read for Single Cell 3’ v1, R1 for Single Cell 3’ v2/v3 and Single Cell 5’) with Q-score >= 30 should be above 75%. A lower fraction might indicate poor sequencing quality.

Fraction of cell barcode bases with Q-score >= 30 is low ( Cell Ranger v6.1 ) : Fraction of cell barcode bases (Illumina I7 Read for Single Cell 3’ v1, R1 for Single Cell 3’ v2/v3 and Single Cell 5’) with Q-score >= 30 should be above 55%. A lower fraction might indicate poor sequencing quality.

Too many detected cells ( Cell Ranger ATAC v2.0 ) : Estimated number of cells is expected to be under 10,000. A high value might indicate an overlapping of cells, a problem during library preparation, or unexpected behavior in the cell calling algorithm.

Average fraction of barcode bases with high sequencing quality is low ( Cell Ranger ATAC v2.0 ) : Average fraction of bases in barcode with quality above Q30 should be ideally above 75%. A lower fraction might indicate poor sequencing quality.

Median fragments per cell is low ( Cell Ranger ATAC v2.0 ) : The median number of fragments (that passed all filters) detected in single cells is expected to be above 500. A lower value suggests low sensitivity, potentially due to insufficient sequencing.

ⓓ ** The percentage of transposition events falling within peaks is low** ( Cell Ranger ATAC v2.0 ) : It is expected that more than 25% of the transposition events fall within peak regions. A lower value could suggest peak undercalling or low sequencing depth.

Estimated number of cells is low ( Cell Ranger ATAC v2.0 ) : Number of cells detected is expected to be higher than 500. This usually indicates poor cell, library, or sequencing quality.

Average fraction of barcode bases with high sequencing quality is low ( Cell Ranger ATAC v2.0 ) : Average fraction of bases in barcode with quality above Q30 should be above 75%. A lower fraction might indicate poor sequencing quality.

Fraction of RNA read bases with Q-score >= 30 is low ( Space Ranger v1.3 ) : Fraction of RNA read bases with Q-score >= 30 should be above 80%. A lower fraction might indicate poor sequencing quality.

Low Fraction Reads in Spots ( Space Ranger v1.3 ) : Ideal > 50%. Application performance may be affected. Many of the reads were not assigned to tissue covered spots. This could be caused by high levels of ambient RNA resulting from inefficient permeabilization or because of poor tissue detection. The latter case can be addressed by using the manual tissue selection option through Loupe.

Method 2. search for technical note

Single Cell Gene Expression Assay

Single Cell Multiome ATAC + Gene Expression Assay

Single ATAC Assay

Visium Assay

Visium Assay2



Input : 2023.05.22 11:48

results matching ""

    No results matching ""