Proteomics Analysis Pipeline
Recommended post: 【Bioinformatics】 Bioinformatics Analysis Table of Contents
1. QC
3. Protein-Protein Interaction
4. Prediction of Protein Variant Function
a. Transcriptomics Analysis Pipeline
b. Collection of Python Functions for Organic Chemistry
1. QC
⑴ Data Import: ThermoRawFileParser, msconvert
⑵ Mass spectrometry QC: PTXQC, rawtools
⑶ Peptide and protein ID: MaxQuant, MSFragger, Comet
⑷ Label-free or isobaric quant: DIA-NN, Skyline, FlashLFQ
2. Motif Analysis
⑴ Overview: Includes epitope and pocket analysis.
⑵ Sequence Logo
① A graphical representation of amino acid or nucleotide multiple sequence alignment
② Developed by Tom Schneider and Mike Stephens
③ The y-axis represents information content as defined in information theory
④ Example 1. When all nucleotide sequences (A, T, G, C) occur at the same frequency : Maximum entropy = 2, Actual entropy = 2, Information content = 0
⑤ Example 2. When only one nucleotide appears : Maximum entropy = 2, Actual entropy = 0, Information content = 2
⑥ Example 3. When two nucleotides appear at the same frequency : Maximum entropy = 2, Actual entropy = 1, Information content = 1
⑶ PROSITE
① A database of protein patterns
② Patterns are defined using regular expressions as follows:
○ Used when an amino acid is known
○ Positions are separated by ‘-‘
○ ‘x’ is a wildcard character
○ ‘[]’ represents ambiguity, i.e., [one of]
○ ‘{}’ represents negation, i.e., {not one of}
○ ‘()’ denotes a range, i.e., (min, max)
○ ‘<’ or ‘>’ indicates the N-terminus or C-terminus of a protein, respectively
③ Examples
○ [AC]-x-V-x(4)-{ED} : [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
○ <A-x-)-V : Translates to N-terminal Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val
⑷ Topological analysis about membrane (e.g., trans-membrane region)
② Tools: TMHMM, TopGraph, Phobius
⑸ PTM
② Tools: MusiteDeep, PTM-prophet, MaxQuant
○ MusiteDeep (detecting 13 different PTM patterns): Hydroxylysine, Hydroxyproline, Methylarginine, Methyllysine, N-linked_glycosylation, N6-acetyllysine, O-linked_glycosylation, Phosphoserine_Phosphothreonine, Phosphotyrosine, Pyrrolidone_carboxylic_acid, S-palmitoyl_cysteine, SUMOylation, Ubiquitination
⑹ Differential abundance analysis
① MSstats
② limma
⑺ Protein pathway enrichment
① Perseus
② clusterProfiler
⑻ Pocket, Epitope discovery
① fpocket
② PUResNet
③ Kalasanty
⑤ Latent-Y
3. Protein-Protein Interaction (PPI; Molecular Docking)
⑴ Key Points
① Binding affinity (BA) is generally quantified by the dissociation constant (Kd) or inhibition constant (Ki)
② General considerations in PPI
○ General characteristics (e.g., atom types)
○ Physicochemical properties (e.g., excluded volume, partial charge, heavy atom neighbors, heteroatom neighbors, hybridization)
○ Pharmacological properties (e.g., hydrophobicity, aromaticity, acid/base, ring formation)
③ Datasets
○ PDBbind database of version 2016
○ Subset 1. General set : Includes all data, i.e., 13,285 protein-ligand complexes
○ Subset 2. Refined set : A subset of the general set, containing 4,057 high-quality complexes
○ Subset 3. Core 2016 set : 290 complexes extracted from the refined set, frequently used as benchmarking data
○ CSAR-HiQ
○ CSAR-HiQ_51 : A subset extracted from an original set of 176 protein-ligand complexes
○ CSAR-HiQ_36 : A subset extracted from an original set of 167 protein-ligand complexes
○ Biolip
○ InterPepScore
④ While there are several models for protein-ligand interactions, models for protein-protein interactions remain relatively scarce
⑵ Metrics
① RMSD: Structure similarity metric. < 2.0 Å is effectively identical backbone geometry while 2-4 Å is similar topology.
② TM-Score (template modeling score): Structure similarity metric. ≥ 0.5 is essentially the same fold while < 0.5 are different folds, topologies.
③ pLDDT: Structure confidence metric
④ Local Interaction Score (LIS)
○ LIS and LIA were calculated using ColabFold output JSON files. Amino acid contacts within the cutoff PAE values were identified as the LIA. Inverted PAE values within the LIA were averaged to determine the LIS. The ‘best’ LIS was derived from the rank 1 model, while ‘average’ LIS was computed from ranks 1-5. A cutoff PAE value of 12, determined to provide the highest AUC, was used for both average and best LIS as per ROC analysis.
⑤ Tanimoto Similarity.
○ Tanimoto similarity** to cluster based on standard molecular fingerprints.
○ Tanimoto ≥ 0.7-0.8 are highly similar analogs and should remain in the same split.
○ Tanimoto ≥ 0.5-0.6 are moderately similar.
⑥ Random Neighbor Score (RNS): Model-agnostic
⑶ Models
① Overview
○ Divided into binding site prediction models and binding affinity prediction models, though the distinction is not strict
○ Generally, a binding distance of 3 Å or less between a ligand and receptor is considered strong binding
② Type 1. AlphaFold2 multimer, AFM-LIS, AlphaFold3
③ Type 2. DeepDTA
④ Type 3. DeepDTAF
⑤ Type 4. DeepFusionDTA
⑥ Type 5. GraphDTA
⑦ Type 6. CAPLA
⑧ Type 7. GNINA
○ Uses CNN for both binding site prediction and affinity evaluation
⑨ Type 8. SMINA
○ Uses physics-based scoring functions for both binding site prediction and affinity evaluation
⑩ Type 9. GLIDE
○ Uses physics-based scoring functions for both binding site prediction and affinity evaluation
⑪ Type 10. EquiBind
○ GNN with SE(3) equivariance
⑫ Type 11. TANKBind
○ Uses the attention mechanism of Transformers
⑬ Type 12. DIFFDOCK
○ Utilizes a diffusion model.
⑭ Type 13. membranefold
○ Imposes membrane-attachment conditions on AlphaFold
⑮ Type 14. Boltz-2, BoltzGen, Boltz Lab
○ Relatively free from the time–accuracy trade-off
○ Recently, Boltzgen—designed to create binders based on Boltz-2—was announced
Figure 1. Boltz-2 Benchmarking Study
○ BoltzLab

Figure 2. Boltz-2 User review
⑯ Type 15. DrugCLIP: Since screening after structure prediction takes a long time, it co-embeds the drug and the pocket first, then performs screening in a search-engine-like manner.
⑰ Type 16. BindCLIP
⑱ Type 17. Chai, Chai2
⑲ Type 18. xQuest
⑳ Type 19. BindCraft : protein generation
㉑ Type 20. JAM-2 : protein generation
㉒ Type 21. Latent-X2 : protein generation
㉓ Type 22. IsoDDE : protein generation
㉔ Type 23. Bepler: A bidirectional LSTM model trained with contrastive learning on global and local protein structural similarity. It converts protein sequences into vectors that capture structural meaning.
4. Prediction of Protein Variant Function
⑴ PolyPhen-2 (Adzhubei et al., 2013)
⑵ SIFT (Kumar et al., 2009)
⑶ Mutation Taster (Schwarz et al., 2014)
⑷ Mutation Assessor (Reva et al., 2011)
⑸ LR and LRT (Chun & Fay, 2009)
5. Metabolomics Analysis
⑴ Peak detection: scipy.signal.find_peaks, peak widths
⑵ LC-MS/GC-MS peak picking, alignment, feature grouping: XCMS centWave
⑶ Normalization, scaling: Median, Quantile (Bolstad 2003), TIC, PQN (Dieterle 2006), Log2
⑷ Metabolite annotation: HMDB m/z matching, [M+H]⁺/[M-H]⁻/[M+Na]⁺ adducts
⑸ Feature quantification, imputation, normalization: Min/2, median, KNN imputation (sklearn); TIC/median/log norm
⑹ Univariate statistical testing with FDR correction: Welch’s t-test, Wilcoxon, ANOVA, Kruskal-Wallis + BH FDR
⑺ Differential metabolite analysis with PCA: Welch’s t-test + BH FDR, PCA visualization
⑻ Pathway enrichment: hypergeometric test (ORA), KEGG pathways, BH FDR
6. Database Utilization
① Integrated Small Molecule Database: Database providing data on the physiological activity of about 800,000 small molecules in vector format
② AlphaFold2 Database: Database with structural data of 200 million proteins
③ ensembl: Transcriptome database
④ uniprot: Protein database
⑤ The Human Protein Atlas: Public access resource aiming to map all human proteins in cells, tissues, and organs
⑥ SGC (Chemical Probes): Provides a unique probe collection along with related data, control compounds, and usage recommendations
⑵ Antigen-antibody Database
① IEDB (Immune Epitope Database)
② VDJdb
③ BciPep
④ SAbDab
⑤ IMGT/3Dstructure-DB
⑥ AACDB(Antigen-Antibody Complex DB)
⑦ Thera-SAbDab
⑧ Cov-AbDab
⑨ Abcam: Commercial
⑩ CST: Commercial
⑪ CiteAb: Commercial
⑫ Antibodypedia: Commercial
⑬ ABCD database
① NCBI dbSNP
② gnomAD
③ pharmVar
④ PHARMGKB
⑤ NCBI PubChem
⑥ Broad Institute CMAP
⑦ CTD
⑧ Comptox
⑨ DrugBank
⑩ Stitch (search tool for interactions of chemicals)
⑪ ToppFun
⑫ DepMap: Provides expression data and lineage information for the corresponding cell line.
⑬ L1000CDS2
⑭ L1000FWD
⑮ GDSC (Genomic of Drug Sensitivity in Cancer)
⑯ CCLE
⑰ ClinicalTrials.gov: Provides information on clinical trial progress for each drug
⑱ Cortellis: Provides information on the clinical trial progress of each drug.
⑲ The Antibody Society: Provides information on the clinical trial progress of antibodies.
⑳ PRISM: Provides large-scale drug response data across hundreds of cancer cell lines.
Input: 2024.03.31 01:08
Modified: 2024.09.29 15:40