Korean, Edit

Bioinformatics Appendix

Recommendation: 【Bioinformatics】 Bioinformatics Analysis Table of Contents


1. Bioinformatics Consortium

2. Data Growth Rate

3. Sequencing Technology Throughput



1. Bioinformatics Consortium

Genome Project

① Started in 1990 under the leadership of Watson and Francis Collins: Launched as a 15-year project by a consortium of six countries

② Collaborative research by over 350 research institutions

○ June 2000: 84.5% completed, draft published

○ April 15, 2003: Final version published with 99.99% accuracy

○ Over 2,800 researchers participated over 13 years, costing $2.7 billion

③ Side effects of human genome research

○ Birth of bioinformatics

○ Accelerated the development of human protein production processes

○ Insulin: The first protein to have its sequence determined

○ Promoted the development of automated sequencing equipment

○ Stimulated genome analysis of other biologically significant organisms

④ Scientists vs Entrepreneurs

○ Scientists: The Human Genome Project (HGP). Watson and Francis Collins. Final cost: $2.7 billion

○ Entrepreneurs: Celera Genomics (started in 1998). Craig Venter. Final cost: $300 million

○ Both groups published their human genome drafts in 2001

○ Scientists were dissatisfied with entrepreneurs taking credit for their work and investments

○ Entrepreneurs were frustrated with scientists for withholding information → Led to new methodologies

○ Final completion of the genome map was acknowledged as a joint achievement

ENCODE Project

① Timeline: Draft in 2001 → In 2003, NHGRI launched the ENCODE project to identify all functional elements in the human genome

② Phase I: 1% of genome. Completed in 2007

③ Phase II: Build-out phase. Completed in 2012

○ Version 7 (December 2010)

○ 51,082 genes: 161,375 transcripts

○ 20,687 protein-coding genes: 76,052 transcripts

○ 9,640 lncRNAs: 15,512 transcripts

④ Phase III: Production phase. Completed in 2016

⑤ Phase IV: Started in 2016-2017

○ Version 29 (May 2018)

○ 58,721 genes: 206,694 transcripts

○ 19,940 protein-coding genes: 83,129 transcripts

○ 16,066 lncRNAs: 29,566 transcripts

○ Version 36 (May 2020)

○ 60,660 genes: 232,117 transcripts

○ 19,962 protein-coding genes: 85,269 transcripts

○ 17,958 lncRNAs: 48,734 transcripts

1000 Genome Project

① Sequenced the whole genomes of 2,504 individuals from 26 global populations, identifying over 88 million genetic variations

② A typical genome differs from the reference human genome at 4.09 million to 5.02 million sites, affecting approximately 20 million bases

GTEx Consortium

4D Nucleome Consortium

Pan-genome Consortium: T2T (telomere-to-telomere)

Cellxgene Census



2. Data Growth Rate


data phase astronomy Twitter YouTube genomics
acquisition 25 zetta-bytes/yr 0.5-15 billion tweets/yr 500-900 million hrs/yr 1 zetta-bases/yr
storage 1 EB/yr 1-17 PB/yr 1-2 EB/yr 2-40 EB/yr
analysis in situ data reduction topic and sentiment mining limited requirements heterogeneous data and analysis
  real-time processing metadata analysis   variant calling, ~2 trillion central processing unit (CPU) hours
  massive volumes      
distribution dedicated lines from antennae to server (600 TB/s) small units of distribution major component of modern user’s bandwidth (10 MB/s) many small (10 MB/s) and fewer massive (10 TB/s) data movement

Table 1. Data Growth Rate (ref)



3. Sequencing Technology Throughput


Platform Sequencer model Read length Reads per run
Illumina iSeq 100 75-300 bp 4 million
  MiniSeq 75-300 bp 25 million
  MiSeq 75-300 bp 25 million
  NextSeq 550 75-150 bp 400 million
  NovaSeq 6000 75-300 bp 10 billion
PacBio Sequel 10-60 kb 1 million
  Sequel II 10-100 kb 7 million
  Sequel IIe 10-100 kb 8 million
Oxford Nanopore MinION 10 kb - 1 Mb 1 million
  GridION 10 kb - 1 Mb 5 million
  PromethION 24 10 kb - 1 Mb 15 million
  PromethION 48 10 kb - 1 Mb 30 million

Table 2. Sequencing Technology Throughput


○ Sanger dideoxy (capillary electrophoresis): 700-800 bp read. Very high accuracy

○ Pyrosequencing: ~400 bp / read

○ Illumina: ~100 bp / read (recently up to 250 bp)



Input: 2022.02.21 12:51

Modified: 2024.10.24 22:06

results matching ""

    No results matching ""