Bioinformatics Appendix
Recommendation: 【Bioinformatics】 Bioinformatics Analysis Table of Contents
3. Sequencing Technology Throughput
1. Bioinformatics Consortium
① Started in 1990 under the leadership of Watson and Francis Collins: Launched as a 15-year project by a consortium of six countries
② Collaborative research by over 350 research institutions
○ June 2000: 84.5% completed, draft published
○ April 15, 2003: Final version published with 99.99% accuracy
○ Over 2,800 researchers participated over 13 years, costing $2.7 billion
③ Side effects of human genome research
○ Birth of bioinformatics
○ Accelerated the development of human protein production processes
○ Insulin: The first protein to have its sequence determined
○ Promoted the development of automated sequencing equipment
○ Stimulated genome analysis of other biologically significant organisms
④ Scientists vs Entrepreneurs
○ Scientists: The Human Genome Project (HGP). Watson and Francis Collins. Final cost: $2.7 billion
○ Entrepreneurs: Celera Genomics (started in 1998). Craig Venter. Final cost: $300 million
○ Both groups published their human genome drafts in 2001
○ Scientists were dissatisfied with entrepreneurs taking credit for their work and investments
○ Entrepreneurs were frustrated with scientists for withholding information → Led to new methodologies
○ Final completion of the genome map was acknowledged as a joint achievement
⑵ ENCODE Project
① Timeline: Draft in 2001 → In 2003, NHGRI launched the ENCODE project to identify all functional elements in the human genome
② Phase I: 1% of genome. Completed in 2007
③ Phase II: Build-out phase. Completed in 2012
○ Version 7 (December 2010)
○ 51,082 genes: 161,375 transcripts
○ 20,687 protein-coding genes: 76,052 transcripts
○ 9,640 lncRNAs: 15,512 transcripts
④ Phase III: Production phase. Completed in 2016
⑤ Phase IV: Started in 2016-2017
○ Version 29 (May 2018)
○ 58,721 genes: 206,694 transcripts
○ 19,940 protein-coding genes: 83,129 transcripts
○ 16,066 lncRNAs: 29,566 transcripts
○ Version 36 (May 2020)
○ 60,660 genes: 232,117 transcripts
○ 19,962 protein-coding genes: 85,269 transcripts
○ 17,958 lncRNAs: 48,734 transcripts
⑶ 1000 Genome Project
① Sequenced the whole genomes of 2,504 individuals from 26 global populations, identifying over 88 million genetic variations
② A typical genome differs from the reference human genome at 4.09 million to 5.02 million sites, affecting approximately 20 million bases
⑷ GTEx Consortium
⑸ 4D Nucleome Consortium
⑹ Pan-genome Consortium: T2T (telomere-to-telomere)
⑺ Cellxgene Census
2. Data Growth Rate
data phase | astronomy | YouTube | genomics | |
---|---|---|---|---|
acquisition | 25 zetta-bytes/yr | 0.5-15 billion tweets/yr | 500-900 million hrs/yr | 1 zetta-bases/yr |
storage | 1 EB/yr | 1-17 PB/yr | 1-2 EB/yr | 2-40 EB/yr |
analysis | in situ data reduction | topic and sentiment mining | limited requirements | heterogeneous data and analysis |
real-time processing | metadata analysis | variant calling, ~2 trillion central processing unit (CPU) hours | ||
massive volumes | ||||
distribution | dedicated lines from antennae to server (600 TB/s) | small units of distribution | major component of modern user’s bandwidth (10 MB/s) | many small (10 MB/s) and fewer massive (10 TB/s) data movement |
Table 1. Data Growth Rate (ref)
3. Sequencing Technology Throughput
Platform | Sequencer model | Read length | Reads per run |
---|---|---|---|
Illumina | iSeq 100 | 75-300 bp | 4 million |
MiniSeq | 75-300 bp | 25 million | |
MiSeq | 75-300 bp | 25 million | |
NextSeq 550 | 75-150 bp | 400 million | |
NovaSeq 6000 | 75-300 bp | 10 billion | |
PacBio | Sequel | 10-60 kb | 1 million |
Sequel II | 10-100 kb | 7 million | |
Sequel IIe | 10-100 kb | 8 million | |
Oxford Nanopore | MinION | 10 kb - 1 Mb | 1 million |
GridION | 10 kb - 1 Mb | 5 million | |
PromethION 24 | 10 kb - 1 Mb | 15 million | |
PromethION 48 | 10 kb - 1 Mb | 30 million |
Table 2. Sequencing Technology Throughput
○ Sanger dideoxy (capillary electrophoresis): 700-800 bp read. Very high accuracy
○ Pyrosequencing: ~400 bp / read
○ Illumina: ~100 bp / read (recently up to 250 bp)
Input: 2022.02.21 12:51
Modified: 2024.10.24 22:06