Chapter 10. Genome Project and Sequencing Technology
Recommended Article: 【Biology】 Biology Table of Contents
c. Biological Informatics Analysis Table of Contents
d. Transcriptome Analysis Pipeline
1. Genome Project
⑴ Overview
① Started under the command of Watson in 1990: Initiated as a 15-year project by a coalition of 6 countries
② Collaborative research involving over 350 research institutions
○ 84.5% completion announced on June 11, 2000, with the draft
○ 99.99% accuracy final version released on April 15, 2003
○ Involvement of over 2,800 researchers over 13 years, costing 2.7 trillion won
③ Side effects of human genome research
○ Birth of biological informatics
○ Promotion of the development of the human protein production process
○ Insulin: The first protein with a determined sequence
○ Promotion of the development of automated DNA sequencing devices
○ Promotion of the genome analysis of other applicable organisms
⑵ Methodology 1: Stepwise Sequencing Method (Scientists’ Approach)
① Step 1: Determining Restriction Enzyme Recognition Sites
○ Cutting DNA with restriction enzymes and electrophoresis reveals the sizes of each fragment
○ Treating two restriction enzymes in various ways reveals the relative distances between their recognition sites
② Step 2: Constructing a Gene Map
○ Determining the relative distances of genes on the chromosome
○ Inferring the distance between genes through recombination rates
③ Step 3: Physical Map (DNA Map) Construction
○ Meaning of determining restriction enzyme recognition sites: Using the information of fragments with known sequences based on each restriction enzyme recognition site to cumulatively construct a physical map
○ Meaning of constructing a gene map: Once a physical map is constructed, it can be compared with a gene map. Introns exist between genes
○ Approach using a single library
⑶ Methodology 2: Shotgun Sequencing (Celera, J. Craig Venter) (Entrepreneurial Approach)
① Cutting one DNA in multiple ways
② Determining the sequence of fragments cut by one method
○ The length of the analysis sample is limited, so the DNA sequence cannot be determined all at once
③ Randomly arranging the sequences of bases in each method until a common result is obtained
④ Method based on computer science
⑷ Scientists’ Approach vs. Entrepreneurial Approach
① Scientists’ approach is dissatisfied with entrepreneurs’ appropriation of their contributions and investments
② Entrepreneurial approach is dissatisfied with scientists’ failure to disclose information, leading to the development of new methodologies
③ The final completion of the genome map is a joint effort of both.
Figure 1. Stepwise Sequencing Method and Shotgun Sequencing Method
2. Sequencing Technology
⑴ Overview
① DNA Sequencing: Applying the principle of DNA replication
○ Template: Each strand of DNA
○ Substrate: dNTP (dATP, dCTP, dGTP, dTTP)
○ NTP has an -OH group at the 2’ carbon and is a material for RNA synthesis
○ DNA polymerase: Phosphate of the next nucleotide binds to the 3’-OH of deoxyribose
○ Synthesis direction: 5’ → 3’, forming complementary base pairs with the template
○ ddNTP: DNA polymerization is terminated because the 3rd carbon lacks an OH group
② RNA Sequencing: Applying the principle of RNA transcription
⑵ in vitro cloning: The very first sequencing method
⑶ Dideoxy Chain Termination Method (= Sanger sequencing): Reported in 1977, Sanger’s second Nobel Prize
① Substrate: dNTP + ddNTP (in small quantities) + buffer (pH stabilized)
○ ddNTP lacks a 3’-OH group, so it terminates the polymerization reaction
○ If ddNTP is added in large quantities, all template DNAs are quickly terminated
② Primer
○ Example: p32-primer (CTAG)
③ 1st. Addition of template DNA and polymerase
④ 2nd. Heating to separate the complementary strand after polymerization
⑤ 3rd. Electrophoresis followed by reading the sequence on X-ray film or fluorescence examination
Figure 2. Process of Dideoxy Chain Termination Method
⑥ Advantages: Can read very long strands, still used in laboratories
⑦ Disadvantages: Requires a large amount of the same DNA strand
⑷ Dye-dideoxy chain termination method: Using laser
① Add a small amount of ddNTP to 4-color fluorescent dNTP.
② Automatic DNA sequencing is possible.
Figure 3. Process of the Dye-dideoxy chain termination method
① Definition: A DNA sequencing method that relies on the proportional luminescence produced based on the amount of pyrophosphate generated during DNA synthesis.
② Diagram
Figure 4. Pyrosequencing diagram
③ Process
Figure 5. Pyrosequencing process
⑹ Illumina solid-phase amplification (ref)
Figure 6. Illumina solid-phase amplification
Figure 7. Fluorescent color distribution photo
① 1st. Fragmentation: Randomly cut the given DNA sample.
② 2nd. Gel-based size selection: Size of each DNA fragment can be limited if necessary.
③ 3rd. Adaptor binding: Attach an adapter to both ends of all DNA sample fragments.
④ 4th. Amplification
○ 4th - 1st. Denature DNA into single-strands.
○ 4th - 2nd. Attach single-stranded DNA to the Illumina flow cell.
○ 4th - 3rd. Add enzymes to allow single-stranded DNA to form bridges on a solid-phase substrate.
○ 4th - 4th. After adding primers to the single-stranded DNA bridges, primers can bind to the bridges.
○ 4th - 5th. Add unlabeled single-stranded DNA and induce DNA synthesis: Forms double-stranded DNA bridges.
○ 4th - 6th. Denature to turn double-stranded DNA bridges into anchored single-stranded DNA.
○ 4th - 7th. Repeat the above six steps to create anchored single-stranded clusters with the same base sequence.
○ **Feature: Anchored single-stranded clusters form millions of clusters.
⑤ 5th. Sequencing by synthesis (SBS)
○ 4th - 1st. Add 4 types of labeled reversible terminators, primers, and DNA polymerase according to the base type.
○ 4th - 2nd. When labeled reversible terminators form phosphodiester bonds, fluorescence is emitted.
○ 4th - 3rd. Obtain a fluorescent color distribution image of each cluster.
○ 4th - 4th. Washing
○ 4th - 5th. Repeat the above four steps to determine the entire base sequence.
○ Type 1. Single-end sequencing (SES): Sequencing with only one adapter.
○ Type 2. Paired-end sequencing (PES): Sequencing with both adapters.
○ Initially, sequence with one adapter (Read1 acquisition), then sequence with the opposite adapter (Read2 acquisition).
○ Read1 and Read2 from the same DNA fragment can be easily matched since they come from the same cluster.
○ Advantages: Higher accuracy (due to Read1 and Read2 comparison), easy detection of DNA variations, easy analysis of repetitive sequences, and easy mapping between different species.
○ Disadvantages: Higher cost and more steps required than SES.
⑺ WGS (Whole Genome Sequencing)
① SNV, insertion, deletion, structural variant, CNV
② Sequencing depth > 30X
⑻ WES (Whole Exon Sequencing)
① Only SNV, insertion, deletion, SNP in protein-coding genes
② Sequencing depth > 50X ~ 100X
③ Cost-effective
⑼ RNA-seq
① 1st. Microdissection: Separating specific tissues for RNA extraction.
○ LCM (Laser Capture Microdissection): Cutting specific tissues with a laser beam. Robust but labor-intensive.
○ TOMO-seq: Using cryosection and computer-based 3D sectioning. Not suitable for clinical purposes.
○ Transcriptome in vivo analysis
○ ProximID
○ STRP-seq
② 2nd. Attach poly T recognizing the poly A tail of RNA.
③ 3rd. Fragment RNA.
④ 4th. Attach primers to RNA.
⑤ 5th. First cDNA synthesis.
⑥ 6th. Second cDNA synthesis.
⑦ 7th. Process the 3’ and 5’ ends of RNA.
⑧ 8th. Ligate DNA sequencing adapters.
⑨ 9th. Amplify ligated fragments with PCR.
⑩ Application 1. dUTP method: A representative method for strand-specific sequencing.
○ Background: Used for studying biological functions based on RNA orientation (e.g., regulation of antisense miRNA).
○ Step 1. DNA &RNA hybrid: Synthesize cDNA (first or anti-sense strand) using dT primers and reverse transcriptase, targeting mRNA poly-A tails.
5’-//-U-//-AAAAAA-3’
3’-//-A-//-TTTTTT-5’
○ Step 2. ds _ cDNA_: Use dUTP instead of dTTP to synthesize cDNA (first strand) as the template for cDNA (second or sense strand).
3’-//-A-//-TTTTTT-5’
5’-//-U-//-AAAAAA-3’
○ Step 3. ligated _ds cDNA_: Connect Y-adaptors to both ends of ds cDNA.
○ Step 4. Treatment with UDG (uracil-DNA glycosylase) breaks down the second strand, which contains uracil.
○ Step 5. Amplify the remaining reverse antisense strand (first strand) to create the library.
○ In the library raw data, “_1.fastq” represents the first strand, while “_2.fastq” represents the second strand.
○ Thus, _2.fastq represents the original RNA profile.
⑽ Single-cell sequencing
① Types: scDNA-seq, scRNA-seq (2013 Technology of the Year), single-cell epigenetics sequencing
② Step 1. Isolation of single cells
○ Method 1. Simple isolation: Very early method.
○ Method 2. Based on FACS or LCM (laser microdissection)
○ Method 3. Acoustic separation
○ Separates single cells hydrodynamically, causing minimal impact on cells.
○ CyTOF (cytometry by time of flight) is a representative method.
○ Method 4. Immuno-magnetic separation
○ Attach magnets to cells.
○ Can obtain a large number of cells.
○ Divided into cases with and without centrifugation requirements.
○ Droplet-based platform and plate-based platform have different library size.
③ Step 2. Reverse transcription
④ Step 3. cDNA amplification
⑤ Step 4. Library construction: e.g., Drop-seq
⑥ Single-cell genomics (scDNA-seq) + Single-cell transcriptomics (scRNA-seq)
○ Allows understanding the relationship between genomic mutation patterns and gene expression in transcriptomes.
○ Technologies for separating DNA and RNA: G&T seq, SIDR-seq, DNTR-seq
⑾ Single Nucleus RNA Sequencing (snRNA-seq)
① Purpose 1. Muscles are multinucleated cells, so they need to be analyzed at the nuclear level as they are not captured by scRNA-seq.
② Purpose 2. snRNA-seq captures more various RNA, including introns, pre-mRNA, non-coding RNA, compared to scRNA-seq.
③ Purpose 3. In snRNA-seq, nuclear RNA is primarily captured, while cytoplasmic RNA is also captured (although in small amounts).
⑿ Spatial Sequencing (▶ Supplement)
Figure 8. Overview of spatial sequencing
Table 1. Comparison of different spatial transcriptomic technologies
① Type 1. Spatial genomics
○ Example 1. Tumor research: Tumors are heterogeneous.
○ Example 2. Spleen research: Mature immune cells have diverse genetic compositions.
② Type 2. Spatial transcriptomics: 2020 Technology of the Year
③ 2-1. Spot-based spatial transcriptomics: Many genes + few spots
Figure 9. Betchmark Study of spot-based spatial transcriptomics
○ ST (Spatial Transcriptomics)
○ Barcoded oligos are randomly arranged on a functionalized surface, capturing mRNA released from the mounted tissues and/or cells.
○ 10X Visium
○ Principle: Attach spot-specific oligonucleotides to each spot to hybridize with tissue-derived RNA, obtaining spotwise transcriptomes.
○ Surface area: 6.5 mm × 6.5 mm
○ Thickness: 10 ~ 20 μm
○ Number of spots: Up to 4992 (Based on previous version of Visium HD)
○ Distance between spots: 100 μm
○ Diameter of spots: 55 μm
○ Sensitivity: 10,000 transcripts per spot
○ Type 1. Direct Visium (oligo-dT based method): Captures mRNAs with poly dT. Only applicable to FF (fresh-frozen) samples.
Figure 10. Principle of Visium FF
○ Type 2. probe-based Visium
○ It can be done in both FF (Fresh Frozen) and FFPE (Formalin-Fixed Paraffin-Embedded) samples. In particular, FFPE (formalin-fixed paraffin-embedded) samples cannot undergo direct Visium due to RNA degradation, where mRNA molecules are fragmented into various pieces.
○ To identify the target mRNA, all three pairs of LHS and RHS must be ligated together: each probe’s length is 25 base pairs. RTL (probe-based RNA-templated ligation chemistry) is utilized for this purpose.
Figure 11. Principle of probe-based Visium
○ Advantage: Superior data quality compared to direct Visium.
○ Disadvantage: Limited freedom in analysis compared to Visium FF, as only genes specified by the probe are detected.
○ For Visium FFPE, starting from June 2024, 10x will discontinue the Visium FFPE service, not using CytAssist.
○ The CytAssist images represent the distribution of gene expression and are used for image alignment.
○ 10X Visium HD
○ The basic data consists of spots with a diameter of 2 μm, and additional data binned at 8 μm and 16 μm are also provided.
○ Slide-seq and Slide-seq V2
○ Employs random spatial bead spreading and in situ sequencing decoding.
○ 97% of spots consist of one or two cell types.
○ HDST
○ Deposits beads with combinatorial barcodes on patterned wafers which are then decoded with serial hybridization.
○ NanoString GeoMx
○ Nanostring lost a patent dispute with 10x Genomics as of Nov ‘23 (ref) → The bankruptsy of Nanostring (ref)
○ Stereo-seq: Higher spatial resolution than Visium
○ Utilizes Illumina or MGI sequencing for oligo patterning on flow cells, and barcode calling is performed directly on the sequencer.
○ Diameter: 220 nm
○ Distance between spots: 500 or 715 nm
○ Seq-Scope: Higher spatial resolution than Visium
○ Utilizes Illumina or MGI sequencing for oligo patterning on flow cells, and barcode calling is performed directly on the sequencer.
○ PIXEL-seq
○ XYZeq
○ Tissue is placed on a spatially barcoded microwell array for an initial round of reverse transcription, after which whole cells are removed and undergo single-cell sequencing.
○ Tissue is placed on a glass slide bearing spatially gridded hashing oligos; tissue is then permeabilized to enable oligo transfer and then imaged; nuclei are then extracted, fixed and sequenced.
○ sci-RNA-seq
○ TIVA-seq
○ NICHE-seq
○ ZipSeq
○ It uses patterned illumination and photocaged oligonucleotides to serially print ‘zipcodes’ onto live cells in intact tissues in real time.
○ DBiT-seq
○ Delivers barcoded oligos directly to tissue through orthogonal microfluidics in a predetermined spatial distribution.
○ CITE-seq (ref1, ref2): Enables parallel comparison of spatial transcriptomics and antibody distribution
Figure 12. Diagram of CITE-seq
○ Connect the 5’ end of oligonucleotide to an antibody using streptavidin-biotin.
○ The oligonucleotide can hybridize complementarily with the oligo-dT primer.
○ Streptavidin-biotin bond can dissociate under reducing conditions.
○ Recently, perturb-CITE-seq was also developed.
○ SPOTS
○ Spatial PrOtein and Transcriptome Sequencing
○ Indirectly assess protein level on Visium using polyadenylated DNA-barcoded antibody
○ Open-ST
○ MAGIC-seq
④ 2-2. FISH based spatial transcriptomics: Few genes + many spots
○ ISS( in situ sequencing): Technique to sequence RNA at its original location in tissue. Sequencing by ligation
○ Type 1. The first ISS
○ Type 2. ISS with Padlock probe
○ Reverse transcriptase creates cDNA of the RNA target
○ Padlock probe can hybridize to two regions of the cDNA
○ Target sequence amplification occurs through RCA (rolling-circle amplification)
○ RCA product is sequenced in situ by ligation
○ Type 4. barcode based methods
○ Type 5. gap-filled ISS
○ smFISH(single molecule FISH) (2008)
○ seqFISH(sequential FISH) (2014): DNAse I-based digestion and sequential staining and imaging rounds to decode transcripts
○ seqFISH+: Genome-scale transcriptome investigation separating individual transcripts into fluorescence spectra, employing 20 probes per encoding round.
○ Vizgen - MERSCOPE (Technology name: MERFISH (multiplexed error-robust FISH))
○ Direct probe hybridization without separate amplification mechanism.
○ Each FISH probe corresponds 1:1 with each gene (though this assumption may not always hold).
○ Employing error correction in barcode assignment for robust barcode calling in noisy FISH-based images.
○ Step 1. Photograph multiple times with fluorescence varying over time for each FISH probe
○ Step 2. Reverse identify genes based on binary code read from each RNA
Figure 13. Principle of MERFISH
○ 10x - Xenium
○ Small amount of padlock probe + rolling circle amplification
○ Step 1. Padlock probe binds complementary RNA transcript in a pincer shape, forming a loop
○ Step 2. RCA (rolling circle amplification): RNA transcript amplified after loop formation
○ Step 3. Hybridize each RNA transcript with a fluorescent probe, then perform fluorescent imaging → washing
○ Step 4. Repeat Step 3 and decode labels for each gene from the generated images
Figure 14. Principle of Xenium
○ Nanostring - CosMx
○ Small amount of probe + branch chain hybridization
○ Nanostring won in the U.S. against 10x for violating antitrust laws in July ‘23 (ref) → The bankruptsy of Nanostring (ref)
Figure 15. Principle of CosMx
○ FISSEQ and oligoFISSEQ
○ Veranome
○ Rebus
○ BOLORAMIS
○ STARmap: Sequencing by ligation
○ SEDAL sequencing
○ ExSeq
○ BaristaSeq: Sequencing by synthesis
○ BARSeq and BARSeq2
○ HybISS
○ SABER
○ clampFISH
○ split-FISH
○ SCRINSHOT
○ PLISH
○ osmFISH
○ ExFISH
○ par-seqFISH
○ EASI-FISH
○ SGA
○ corrFISH
⑤ Type 3. Spatial proteomics: Broadly classified into mass spectrometry-based and imaging-based methods
○ SWITCH
○ MxIF
○ t-CyCIF
○ IBEX
○ DEI
○ CODEX
○ immuno-SABER
○ TSA
○ Opal IHC
○ MIBI
○ IMC
○ HD-MIBI
○ GeoMx Digital Spatial Profiler (DSP): 100 mm scale
○ GeoMX DSP stains tissues with suites of antibodies or gene probes fused to UV-cleavable DNA barcodes.
⒀ Other sequencing technologies
① TCR-seq (T cell receptor sequencing): Sequencing used to track T cell subtypes and clones.
② Invade-seq: A sequencing technique for analyzing the host-microbiome.
③ long-read sequencing: 2022 Technology of the Year Technology of the Year (Reference)
○ Less sequencing gap compared to short-read sequencing
Figure 16. long-read sequencing and short-read sequencing
○ Advantage 1. AS Analysis(alternative splicing analysis): Can identify alternative splicing events, isoforms, etc.
○ Advantage 2. Easier integration of epigenetics and transcriptomics
○ Example 1. Pacific Biosciences SMRT (single molecule real-time) sequencing: Average read length is ~20 kb
○ Example 2. Oxford Nanopore Sequencing: Average read length is ~100 kb
④ non-invasive sequencing
○ A technology that allows sequencing without breaking cells
⑤ Halo-seq: A technique for obtaining the transcriptome of RNAs adjacent to a specific protein.
○ Step 1. Attach a HaloTag domain to a specific target.
○ Step 2. This HaloTag generates an alkyne handle radical by ejecting a hydrogen radical H· from a radical-producing Halo ligand injected with an alkyne handle.
○ R-H → R· + H·
○ Step 3. Similarly, the HaloTag generates an RNA radical by ejecting a hydrogen radical H· from RNA.
○ RNA-H → RNA· + H·
○ Step 4. The alkyne handle radical combines with the RNA radical.
○ Step 5. React alkyne-RNA with biotin azide to produce biotinylated RNA.
○ Step 6. Separate only the biotinylated RNA using affinity chromatography with streptavidin.
○ Step 7. RNA-seq allows for the detection of RNAs close to the specific target.
○ Reason: Radicals are unstable and cannot travel long distances.
Figure 17. Principle of Halo-seq
⑥ multi-NTT seq (nanobody tethered transposition followed by sequencing)
⑦ Epigenomics Sequencing(epigenomics sequencing)
⑨ Temporal Sequencing
○ Live-seq
○ TMI
⑩ Spatiotemporal Omics
○ ORBIT (single-molecule DNA origami rotation measurement)
○ 4D spatiotemporal MRI or hyperpolarized MR
○ in vivo 4D omics with transparent mice
⒁ NGS (next-generation sequencing) Summary
① Cost of genome analysis
○ 2001: Human Genome Project benchmark $100 million / person
○ 2007: 100 billion won / 4 years
○ 2008: 454 Life Sciences standard $1,000,000 / person. 1.5 billion won / 4.5 months
○ 2009: Helicos BioSciences standard $48,000 / person
○ Predicted to be sufficient with one million won by 2014 (Nature 456, 23-25, 2008)
② Scale of genome analysis
Figure 18. Trend of genome analysis scale
③ Relationship between depth and coverage
○ sequencing depth (read depth): Indicates how many times a specific nucleotide appears on average
Figure 19. Definition of depth
○ “10x” means it was read 10 times
○ Can be defined for each nucleotide
○ coverage (c)
○ c: = LN / G
○ L: read length
○ N: number of reads
○ G: haploid genome length
○ Sequencing depth represents total read number
○ Coverage represents the relationship between sequence reads and reference (e.g. whole genome, al locus)
○ Otherwise, depth and coverage are very similar concepts
④ Relationship between bulk and read
○ bulk: total RNA production
○ In case of equal depth, as bulk increases, RNA read count is inversely proportional, causing irrationality
○ Example: In spatial transcriptomics, bulk is typically large and depth is low, resulting in low RNA read count
○ Normalization: Various methods have been introduced to resolve this irrationality
⑤ Relationship between read count and number of reads
○ If read length is less than 250 bp, it is impossible to detect sequence error
○ Relationship between read length and number of reads per run: there is a trade-off
Figure 20. Relationship between read length and number of reads per run
⑥ Relationship between transcriptome read count and gene expression
○ read count: Actual number of transcripts
○ gene expression: Value corrected from read count through normalization process
Entered: 2015.07.02 23:31
Updated: 2022.03.13 13:11