Chapter 11. Bioinformatics
Recommended Article: 【Biology】 Biology Table of Contents
1. Overview
4. Epigenetics
5. Metagenomics
7. Proteomics
8. Metabolomics
10. Phenomics
11. Radiomics
a. Bioinformatics Analysis Table of Contents
b. Transcriptome Analysis Pipeline
d. Determining Cell Types with Seurat
1. Overview
⑴ Cancer types > 102
⑵ Cancer patients per year ~ 2 × 106
⑶ Transcription factor ~ 1600
⑷ Driver mutations ~ 105
⑸ Variant combinations ~ 100000C6
⑹ Cell types and states ~ 104
⑺ Gene combinations ~ 1013
⑻ Antibody sequences ~ 2032
⑼ Small molecules ~ 1060
2. Comparative Genomics
⑴ Features of the Human Genome
① The human genome is composed of 3.1 billion base pairs.
② Less than 1/3 is transcribed into RNA, and only about 5% encode proteins.
③ Genes encoding proteins amount to around 20,000 to 25,000: Similar to other mammals.
④ Genes are on average about 3,000 bases long.
⑤ All humans are at least 99.9% identical.
⑥ The human genome contains a significant amount of repetitive sequences.
⑦ Less than 7% of protein-coding genes are specific to vertebrates.
⑵ Prokaryotic Genes vs Eukaryotic Genes
① Polycistronic mRNA vs Monocistronic mRNA (number of proteins encoded by one mRNA)
② Intron (×) vs Intron (O)
③ Simultaneity of transcription and translation (O) vs Simultaneity of transcription and translation (×)
④ mRNA processing (×) vs mRNA processing (O)
⑶ Comparison of Genome Size and Gene Number in Various Organisms
Type of Organism | Subcategory | Species | Genome Size (Mb: 106) | Number of Protein-Coding Genes | Protein Coding Sequence (%) |
---|---|---|---|---|---|
Prokaryotes | Mycoplasma | 0.58 | 470 | 88 | |
E. coli | 4.64 | 4,300 | 88 | ||
Bacillus subtilis | 4.20 | ||||
Eukaryotes | Fungi | Yeast | 12.6 | 6,200 | 70 |
Aspergillus | 25.4 | ||||
Protozoa | Tetrahymena | 190 | |||
Invertebrates | C. elegans | 100 | 21,000 | 25 | |
Drosophila | 180 | 15,000 | 13 | ||
Silkworm | 490 | ||||
Sea Urchin | 845 | ||||
Vertebrates | Pufferfish | 400 | |||
Human | 3,000 | ~23,500 | 1.5 | ||
Mouse | 3,300 | ||||
Plants | Arabidopsis | 125 | 26,000 | 25 | |
Rice | 440 | 35,000 ~ 50,000 | 10 | ||
Pea | 4,800 | ||||
Corn | 5,000 | ||||
Wheat | 17,000 |
Table 1. Genome Sizes of Various Organisms
① Minimum gene count for life maintenance: Among 470 genes in M. genitalium, 337 are essential.
② Weak correlation between genome size and organism complexity.
③ Plant genomes are large due to frequent polyploidization.
⑷ Comparison of Single-Celled Prokaryotic and Eukaryotic Genomes
E. coli | Yeast | |
---|---|---|
Genome Size (Base Pairs) | 4,640,000 | 12,068,000 |
Number of Protein-Coding Genes | 4,300 | 6,200 |
Metabolism | 650 | 650 |
Energy Production and Storage | 240 | 175 |
Membrane Transporters | 280 | 250 |
DNA Replication, Repair, and Recombination | 120 | 175 |
Transcription | 230 | 400 |
Translation | 180 | 350 |
Protein Delivery and Secretion | 35 | 430 |
Cell Structure | 180 | 250 |
Table 2. Comparison between E. coli and Yeast
⑸ Essential genes required for multicellular organism characteristics (e.g., Caenorhabditis elegans).
Function | Protein Domain | Genes |
---|---|---|
Transcription Regulation | Zinc Finger; Homeobox | 540 |
RNA Processing | RNA Binding Domain | 100 |
Action Potential Transmission | Gated Ion Channel | 80 |
Tissue Formation | Collagen | 170 |
Cell Interaction | Extracellular Domain; Glycosyltransferase | 330 |
Cell-Cell Signaling | G Protein-Coupled Receptor, Protein Kinase, Protein Phosphatase | 1,290 |
Table 3. Caenorhabditis elegans (C. elegans)
⑹ Comparison of human and mouse genomes
① Humans and mice have approximately a 50% difference in their nucleotide sequences and diverged around 75 million years ago.
② There is no significant difference in the genome size or the number of genes they possess; only the distribution of transposons, a type of repetitive sequence element, differs.
③ Genome composition: Approximately 180 fragmentation and recombination events have occurred, with over 90% of the genome moving as blocks (conserved synteny).
⑺ Comparison between human and chimpanzee
① The difference in genes between humans and chimpanzees is only 1.23%.
⑻ Comparison of mitochondria and chloroplasts
① Mitochondrial genomics: 16,569 bp. 37 genes.
○ Many mitochondrial proteins are derived from the nucleus.
○ Example: β-oxidation and TCA cycle enzymes are transported from the cytoplasm.
○ Some proteins are transcribed and translated from mitochondrial DNA.
○ Example: Electron transport chain proteins and ATP synthase are synthesized independently.
○ Termination codon: CAG
② Chloroplast genomics
○ Enzymes for Calvin cycle are synthesized independently.
○ The large subunit of Rubisco is produced in the chloroplast, while the small subunit is produced in the cytoplasm.
○ Not only β-oxidation and TCA cycle enzymes but also electron transport chain proteins and ATP synthase are transported from the cytoplasm.
③ Chloroplast genome is much larger than mitochondrial genome.
○ Mitochondria: Repetitive sequences, no introns.
○ Chloroplast: Repetitive sequences, many introns.
○ Most mitochondrial genes have moved to the nucleus.
3. Functional Genomics
⑴ Overview
① Definition: The study of all functions of DNA, including introns and regulatory elements.
② Utilizes sequencing technologies like WGS, WES, GWAS, Chip-seq.
⑵ Movement of Genetic Material
① Virus
③ Mobile DNA: Transposons, Retrotransposons, LINE, SINE
⑶ Intermediate-frequency Repeat Sequences
① VNTR (Variable Number Tandem Repeats, relatively long), STR (Short Tandem Repeats, relatively short), Telomeres.
② Genetic anticipation: As generations increase, repetitive sequences expand, leading to a higher likelihood of disease occurrence (e.g., Huntington’s disease).
⑷ High-frequency Repeat Sequences
① Highly condensed.
② Centromere, satellite.
⑸ Satellite DNA
① A-T rich repetitive DNA.
② Low buoyant density.
⑹ Multigene Families
① Homologous gene families (e.g., rRNA)
② Paralogous gene families (e.g., hemoglobin)
⑺ Single Nucleotide Polymorphism (SNP)
⑻ Copy Number Variation (CNV)
⑼ Loss of Heterozygosity (LOH)
⑽ Genomic Rearrangement
⑾ Rare Variant
4. Epigenetics
⑴ Overview
① Loop formation: Can occur when inverted repeat sequences are present on coding DNA.
② Intrinsic transcription terminators, t-RNA, telomere tetra G, etc. contribute to loop formation.
⑵ Subfields
① BS-seq (bisulfide sequencing)
② ChIP-seq (chromatin immunoprecipitation sequencing)
③ Hi-C sequencing (high throughput chromatin conformation capture sequencing)
④ ATAC-seq (bulk & single cell)
⑤ NOMe-seq
5. Metagenomics
⑴ Definition: Collection of all microbial genomes present in a given environment.
⑵ Also referred to as metagenome, community genomics, and pangenomics.
6. Transcriptomics
⑴ Definition
① Study of the functions of transcribed RNA.
② Uses RNA, which is significantly more sensitive compared to proteins.
⑵ Subfields
① Bulk transcriptomics (bulk RNA-seq)
② Single-cell transcriptomics (single cell RNA-seq): Method of the year in 2013.
③ Spatial transcriptomics (spatial RNA-seq): Method of the year in 2020.
④ Structural transcriptomics: Related to epigenetics.
⑤ Alternative splicing and isoform analysis: Method of the year in 2022.
⑥ RNA interference: miRNA, siRNA, etc.
⑦ Long non-coding RNA
⑧ Small RNA
⑨ Pseudogene: Transcribed but untranslated gene.
○ Type 1: Cases where replication occurred through retrotransposons, but introns and promoters were lost.
○ Type 2: Cases where genes were disabled due to accumulated mutations.
7. Proteomics
⑴ Overview
① Definition: The study of the expression patterns of translated proteins.
② Targets over a million proteins.
③ Transcriptomics explains only about 40% of actual proteomics.
Figure 1. mRNA abundance vs. protein abundance in NIH3T3 cells
④ Advantages: Detects biomarkers closely related to physiological phenomena.
⑤ Disadvantages: Less sensitivity compared to DNA and RNA.
⑵ Subfields
① Protein expression: Cytokine array, etc.
② PTM (post-translational modification)
③ Structural proteomics
○ Protein’s quaternary structure (i.e., multiple polypeptides composing a protein).
○ Amino acids that are far apart in primary structure may be close in reality.
○ Example: In trypsinogen, His and Ser, which form the catalytic triad, are distant in the primary structure but come together to form a single active site.
○ Generally, to analyze protein sequences, peptidases (proteases) are used to break them into fragments of a certain length or shorter.
④ Phospho-proteomics
⑤ Glycomics
8. Metabolomics
⑴ Metabolite profiling: Carried out in serum, plasma, urine, CSF, etc.
⑵ Tandem mass spec
9. Pharmacomics
⑴ Overview: Utilizes high-throughput screening technology.
⑵ Affymetrix GeneChip: HG-U133 Plus 2.0 Array, etc.
⑶ Luminex bead arrays (L1000)
⑷ Illumina Human HT-12 v4 Expression BeadChip Array
⑸ mRNA-seq (Illumina Hi-Seq)
⑹ GCP: Histone profiling
⑺ P100: Phosphoproteomics
⑻ KINOMEscan
⑼ KiNativ
⑽ MEMA
⑾ ELISA
⑿ RPPA
⒀ ATAC-seq
⒁ Cellarium
⒂ SWATH-MS
10. Phenomics
⑴ Cancer
⑵ Metabolic syndrome
⑶ Psychiatric disease
11. Radiomics
⑴ Definition: Fusion of nuclear medicine imaging and genomic information.
Input: 2021.06.12 13:56
Modified: 2022.03.17 13:44