Proteomics Analysis Pipeline
Recommended post: 【Bioinformatics】 Bioinformatics Analysis Table of Contents
1. QC
3. Protein-Protein Interaction
4. Prediction of Protein Variant Function
a. Transcriptomics Analysis Pipeline
b. Collection of Python Functions for Organic Chemistry
1. QC
⑴ Data Import: ThermoRawFileParser, msconvert
⑵ Mass spectrometry QC: PTXQC, rawtools
⑶ Peptide and protein ID: MaxQuant, MSFragger, Comet
⑷ Label-free or isobaric quant: DIA-NN, Skyline, FlashLFQ
2. Motif Analysis
⑴ Overview: Includes epitope and pocket analysis.
⑵ Sequence Logo
① A graphical representation of amino acid or nucleotide multiple sequence alignment
② Developed by Tom Schneider and Mike Stephens
③ The y-axis represents information content as defined in information theory
④ Example 1. When all nucleotide sequences (A, T, G, C) occur at the same frequency : Maximum entropy = 2, Actual entropy = 2, Information content = 0
⑤ Example 2. When only one nucleotide appears : Maximum entropy = 2, Actual entropy = 0, Information content = 2
⑥ Example 3. When two nucleotides appear at the same frequency : Maximum entropy = 2, Actual entropy = 1, Information content = 1
⑶ PROSITE
① A database of protein patterns
② Patterns are defined using regular expressions as follows:
○ Used when an amino acid is known
○ Positions are separated by ‘-‘
○ ‘x’ is a wildcard character
○ ‘[]’ represents ambiguity, i.e., [one of]
○ ‘{}’ represents negation, i.e., {not one of}
○ ‘()’ denotes a range, i.e., (min, max)
○ ‘<’ or ‘>’ indicates the N-terminus or C-terminus of a protein, respectively
③ Examples
○ [AC]-x-V-x(4)-{ED} : [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
○ <A-x-)-V : Translates to N-terminal Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val
⑷ Topological analysis about membrane (e.g., trans-membrane region)
② Tools: TMHMM, TopGraph, Phobius
⑸ PTM
② Tools: MusiteDeep, PTM-prophet, MaxQuant
○ MusiteDeep (detecting 13 different PTM patterns): Hydroxylysine, Hydroxyproline, Methylarginine, Methyllysine, N-linked_glycosylation, N6-acetyllysine, O-linked_glycosylation, Phosphoserine_Phosphothreonine, Phosphotyrosine, Pyrrolidone_carboxylic_acid, S-palmitoyl_cysteine, SUMOylation, Ubiquitination
⑹ Differential abundance analysis
① MSstats
② limma
⑺ Protein pathway enrichment
① Perseus
② clusterProfiler
⑻ Pocket, Epitope discovery
① fpocket
② PUResNet
③ Kalasanty
⑤ Site4Drug
3. Protein-Protein Interaction (PPI; Molecular Docking)
⑴ Key Points
① Binding affinity (BA) is generally quantified by the dissociation constant (Kd) or inhibition constant (Ki)
② General considerations in PPI
○ General characteristics (e.g., atom types)
○ Physicochemical properties (e.g., excluded volume, partial charge, heavy atom neighbors, heteroatom neighbors, hybridization)
○ Pharmacological properties (e.g., hydrophobicity, aromaticity, acid/base, ring formation)
③ Datasets
○ PDBbind database of version 2016
○ Subset 1. General set : Includes all data, i.e., 13,285 protein-ligand complexes
○ Subset 2. Refined set : A subset of the general set, containing 4,057 high-quality complexes
○ Subset 3. Core 2016 set : 290 complexes extracted from the refined set, frequently used as benchmarking data
○ CSAR-HiQ
○ CSAR-HiQ_51 : A subset extracted from an original set of 176 protein-ligand complexes
○ CSAR-HiQ_36 : A subset extracted from an original set of 167 protein-ligand complexes
○ Biolip
○ InterPepScore
④ While there are several models for protein-ligand interactions, models for protein-protein interactions remain relatively scarce
⑵ Models
① Overview
○ Divided into binding site prediction models and binding affinity prediction models, though the distinction is not strict
○ Generally, a binding distance of 3 Å or less between a ligand and receptor is considered strong binding
② Type 1. AlphaFold2 multimer, AFM-LIS, AlphaFold3
③ Type 2. DeepDTA
④ Type 3. DeepDTAF
⑤ Type 4. DeepFusionDTA
⑥ Type 5. GraphDTA
⑦ Type 6. CAPLA
⑧ Type 7. GNINA
○ Uses CNN for both binding site prediction and affinity evaluation
⑨ Type 8. SMINA
○ Uses physics-based scoring functions for both binding site prediction and affinity evaluation
⑩ Type 9. GLIDE
○ Uses physics-based scoring functions for both binding site prediction and affinity evaluation
⑪ Type 10. EquiBind
○ GNN with SE(3) equivariance
⑫ Type 11. TANKBind
○ Uses the attention mechanism of Transformers
⑬ Type 12. DIFFDOCK
○ Utilizes a diffusion model.
⑭ Type 13. membranefold
○ Imposes membrane-attachment conditions on AlphaFold
⑮ Type 14. Boltz-2, BoltzGen, Boltz Lab
○ Relatively free from the time–accuracy trade-off
○ Recently, Boltzgen—designed to create binders based on Boltz-2—was announced
Figure 1. Boltz-2 Benchmarking Study
○ BoltzLab

Figure 2. Boltz-2 User review
⑯ Type 15. DrugCLIP: Since screening after structure prediction takes a long time, it co-embeds the drug and the pocket first, then performs screening in a search-engine-like manner.
⑰ Type 16. BindCLIP
⑱ Type 17. Chai
⑲ Type 18. xQuest
4. Prediction of Protein Variant Function
⑴ PolyPhen-2 (Adzhubei et al., 2013)
⑵ SIFT (Kumar et al., 2009)
⑶ Mutation Taster (Schwarz et al., 2014)
⑷ Mutation Assessor (Reva et al., 2011)
⑸ LR and LRT (Chun & Fay, 2009)
5. Metabolomics Analysis
⑴ Peak detection: scipy.signal.find_peaks, peak widths
⑵ LC-MS/GC-MS peak picking, alignment, feature grouping: XCMS centWave
⑶ Normalization, scaling: Median, Quantile (Bolstad 2003), TIC, PQN (Dieterle 2006), Log2
⑷ Metabolite annotation: HMDB m/z matching, [M+H]⁺/[M-H]⁻/[M+Na]⁺ adducts
⑸ Feature quantification, imputation, normalization: Min/2, median, KNN imputation (sklearn); TIC/median/log norm
⑹ Univariate statistical testing with FDR correction: Welch’s t-test, Wilcoxon, ANOVA, Kruskal-Wallis + BH FDR
⑺ Differential metabolite analysis with PCA: Welch’s t-test + BH FDR, PCA visualization
⑻ Pathway enrichment: hypergeometric test (ORA), KEGG pathways, BH FDR
6. Database Utilization
① Integrated Small Molecule Database: Database providing data on the physiological activity of about 800,000 small molecules in vector format
② AlphaFold2 Database: Database with structural data of 200 million proteins
③ ensembl: Transcriptome database
④ uniprot: Protein database
⑤ The Human Protein Atlas: Public access resource aiming to map all human proteins in cells, tissues, and organs
⑥ SGC (Chemical Probes): Provides a unique probe collection along with related data, control compounds, and usage recommendations
⑵ Antigen-antibody Database
① IEDB (Immune Epitope Database)
② VDJdb
③ BciPep
④ SAbDab
⑤ IMGT/3Dstructure-DB
⑥ AACDB(Antigen-Antibody Complex DB)
⑦ Thera-SAbDab
⑧ Cov-AbDab
⑨ Abcam: Commercial
⑩ CST: Commercial
⑪ CiteAb: Commercial
⑫ Antibodypedia: Commercial
⑬ ABCD database
① NCBI dbSNP
② gnomAD
③ pharmVar
④ PHARMGKB
⑤ NCBI PubChem
⑥ Broad Institute CMAP
⑦ CTD
⑧ Comptox
⑨ DrugBank
⑩ Stitch (search tool for interactions of chemicals)
⑪ ToppFun
⑫ DepMap: Provides expression data and lineage information for the corresponding cell line.
⑬ L1000CDS2
⑭ L1000FWD
⑮ GDSC (Genomic of Drug Sensitivity in Cancer)
⑯ CCLE
⑰ ClinicalTrials.gov: Provides information on clinical trial progress for each drug
⑱ Cortellis: Provides information on the clinical trial progress of each drug.
⑲ The Antibody Society: Provides information on the clinical trial progress of antibodies.
⑳ PRISM: Provides large-scale drug response data across hundreds of cancer cell lines.
Input: 2024.03.31 01:08
Modified: 2024.09.29 15:40