Korean, Edit

Proteomics Analysis Pipeline

Recommended post: 【Bioinformatics】 Bioinformatics Analysis Table of Contents


1. QC

2. Motif Analysis

3. Protein-Protein Interaction

4. Prediction of Protein Variant Function

5. Metabolomics Analysis

6. Database Utilization


a. Transcriptomics Analysis Pipeline

b. Collection of Python Functions for Organic Chemistry



1. QC

⑴ Data Import: ThermoRawFileParser, msconvert

⑵ Mass spectrometry QC: PTXQC, rawtools

⑶ Peptide and protein ID: MaxQuant, MSFragger, Comet

⑷ Label-free or isobaric quant: DIA-NN, Skyline, FlashLFQ



2. Motif Analysis

⑴ Overview: Includes epitope and pocket analysis.

⑵ Sequence Logo

① A graphical representation of amino acid or nucleotide multiple sequence alignment

② Developed by Tom Schneider and Mike Stephens

③ The y-axis represents information content as defined in information theory

Example 1. When all nucleotide sequences (A, T, G, C) occur at the same frequency : Maximum entropy = 2, Actual entropy = 2, Information content = 0

Example 2. When only one nucleotide appears : Maximum entropy = 2, Actual entropy = 0, Information content = 2

Example 3. When two nucleotides appear at the same frequency : Maximum entropy = 2, Actual entropy = 1, Information content = 1

⑶ PROSITE

① A database of protein patterns

② Patterns are defined using regular expressions as follows:

○ Used when an amino acid is known

○ Positions are separated by ‘-‘

○ ‘x’ is a wildcard character

○ ‘[]’ represents ambiguity, i.e., [one of]

○ ‘{}’ represents negation, i.e., {not one of}

○ ‘()’ denotes a range, i.e., (min, max)

○ ‘<’ or ‘>’ indicates the N-terminus or C-terminus of a protein, respectively

③ Examples

○ [AC]-x-V-x(4)-{ED} : [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}

○ <A-x-)-V : Translates to N-terminal Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val

⑷ Topological analysis about membrane (e.g., trans-membrane region)

Membrane protein and hydropathy index

② Tools: TMHMM, TopGraph, Phobius

⑸ PTM

Post-translational modification (PTM)

② Tools: MusiteDeep, PTM-prophet, MaxQuant

○ MusiteDeep (detecting 13 different PTM patterns): Hydroxylysine, Hydroxyproline, Methylarginine, Methyllysine, N-linked_glycosylation, N6-acetyllysine, O-linked_glycosylation, Phosphoserine_Phosphothreonine, Phosphotyrosine, Pyrrolidone_carboxylic_acid, S-palmitoyl_cysteine, SUMOylation, Ubiquitination

⑹ Differential abundance analysis

① MSstats

② limma

⑺ Protein pathway enrichment

① Perseus

② clusterProfiler

⑻ Pocket, Epitope discovery

fpocket

② PUResNet

③ Kalasanty

RAPID-Net

Latent-Y



3. Protein-Protein Interaction (PPI; Molecular Docking)

⑴ Key Points

① Binding affinity (BA) is generally quantified by the dissociation constant (Kd) or inhibition constant (Ki)

② General considerations in PPI

○ General characteristics (e.g., atom types)

○ Physicochemical properties (e.g., excluded volume, partial charge, heavy atom neighbors, heteroatom neighbors, hybridization)

○ Pharmacological properties (e.g., hydrophobicity, aromaticity, acid/base, ring formation)

③ Datasets

○ PDBbind database of version 2016

Subset 1. General set : Includes all data, i.e., 13,285 protein-ligand complexes

Subset 2. Refined set : A subset of the general set, containing 4,057 high-quality complexes

Subset 3. Core 2016 set : 290 complexes extracted from the refined set, frequently used as benchmarking data

CASF-2013

○ CSAR-HiQ

CSAR-HiQ_51 : A subset extracted from an original set of 176 protein-ligand complexes

CSAR-HiQ_36 : A subset extracted from an original set of 167 protein-ligand complexes

○ Biolip

○ InterPepScore

④ While there are several models for protein-ligand interactions, models for protein-protein interactions remain relatively scarce

⑵ Metrics

① RMSD: Structure similarity metric. < 2.0 Å is effectively identical backbone geometry while 2-4 Å is similar topology.

② TM-Score (template modeling score): Structure similarity metric. ≥ 0.5 is essentially the same fold while < 0.5 are different folds, topologies.

③ pLDDT: Structure confidence metric

Local Interaction Score (LIS)

○ LIS and LIA were calculated using ColabFold output JSON files. Amino acid contacts within the cutoff PAE values were identified as the LIA. Inverted PAE values within the LIA were averaged to determine the LIS. The ‘best’ LIS was derived from the rank 1 model, while ‘average’ LIS was computed from ranks 1-5. A cutoff PAE value of 12, determined to provide the highest AUC, was used for both average and best LIS as per ROC analysis.

⑤ Tanimoto Similarity.

○ Tanimoto similarity** to cluster based on standard molecular fingerprints.

○ Tanimoto ≥ 0.7-0.8 are highly similar analogs and should remain in the same split.

○ Tanimoto ≥ 0.5-0.6 are moderately similar.

Random Neighbor Score (RNS): Model-agnostic

⑶ Models

① Overview

○ Divided into binding site prediction models and binding affinity prediction models, though the distinction is not strict

○ Generally, a binding distance of 3 Å or less between a ligand and receptor is considered strong binding

Type 1. AlphaFold2 multimer, AFM-LIS, AlphaFold3

Type 2. DeepDTA

Type 3. DeepDTAF

Type 4. DeepFusionDTA

Type 5. GraphDTA

Type 6. CAPLA

Type 7. GNINA

○ Uses CNN for both binding site prediction and affinity evaluation

Type 8. SMINA

○ Uses physics-based scoring functions for both binding site prediction and affinity evaluation

Type 9. GLIDE

○ Uses physics-based scoring functions for both binding site prediction and affinity evaluation

Type 10. EquiBind

○ GNN with SE(3) equivariance

Type 11. TANKBind

○ Uses the attention mechanism of Transformers

Type 12. DIFFDOCK

○ Utilizes a diffusion model.

Type 13. membranefold

○ Imposes membrane-attachment conditions on AlphaFold

Type 14. Boltz-2, BoltzGen, Boltz Lab

○ Relatively free from the time–accuracy trade-off

○ Recently, Boltzgen—designed to create binders based on Boltz-2—was announced


스크린샷 2026-02-20 오전 11 53 12

Figure 1. Boltz-2 Benchmarking Study


○ BoltzLab


BoltzLab use review

Figure 2. Boltz-2 User review


Type 15. DrugCLIP: Since screening after structure prediction takes a long time, it co-embeds the drug and the pocket first, then performs screening in a search-engine-like manner.

Type 16. BindCLIP

Type 17. Chai, Chai2

Type 18. xQuest

Type 19. BindCraft : protein generation

Type 20. JAM-2 : protein generation

Type 21. Latent-X2 : protein generation

Type 22. IsoDDE : protein generation

Type 23. Bepler: A bidirectional LSTM model trained with contrastive learning on global and local protein structural similarity. It converts protein sequences into vectors that capture structural meaning.



4. Prediction of Protein Variant Function

⑴ PolyPhen-2 (Adzhubei et al., 2013)

⑵ SIFT (Kumar et al., 2009)

⑶ Mutation Taster (Schwarz et al., 2014)

⑷ Mutation Assessor (Reva et al., 2011)

⑸ LR and LRT (Chun & Fay, 2009)



5. Metabolomics Analysis

⑴ Peak detection: scipy.signal.find_peaks, peak widths

⑵ LC-MS/GC-MS peak picking, alignment, feature grouping: XCMS centWave

⑶ Normalization, scaling: Median, Quantile (Bolstad 2003), TIC, PQN (Dieterle 2006), Log2

⑷ Metabolite annotation: HMDB m/z matching, [M+H]⁺/[M-H]⁻/[M+Na]⁺ adducts

⑸ Feature quantification, imputation, normalization: Min/2, median, KNN imputation (sklearn); TIC/median/log norm

⑹ Univariate statistical testing with FDR correction: Welch’s t-test, Wilcoxon, ANOVA, Kruskal-Wallis + BH FDR

⑺ Differential metabolite analysis with PCA: Welch’s t-test + BH FDR, PCA visualization

⑻ Pathway enrichment: hypergeometric test (ORA), KEGG pathways, BH FDR



6. Database Utilization

Small Molecule Database

Integrated Small Molecule Database: Database providing data on the physiological activity of about 800,000 small molecules in vector format

AlphaFold2 Database: Database with structural data of 200 million proteins

ensembl: Transcriptome database

uniprot: Protein database

The Human Protein Atlas: Public access resource aiming to map all human proteins in cells, tissues, and organs

SGC (Chemical Probes): Provides a unique probe collection along with related data, control compounds, and usage recommendations

⑵ Antigen-antibody Database

① IEDB (Immune Epitope Database)

② VDJdb

③ BciPep

④ SAbDab

⑤ IMGT/3Dstructure-DB

⑥ AACDB(Antigen-Antibody Complex DB)

⑦ Thera-SAbDab

⑧ Cov-AbDab

⑨ Abcam: Commercial

⑩ CST: Commercial

⑪ CiteAb: Commercial

⑫ Antibodypedia: Commercial

⑬ ABCD database

Pharmacogenomics Database

① NCBI dbSNP

② gnomAD

③ pharmVar

④ PHARMGKB

⑤ NCBI PubChem

⑥ Broad Institute CMAP

⑦ CTD

⑧ Comptox

⑨ DrugBank

⑩ Stitch (search tool for interactions of chemicals)

ToppFun

⑫ DepMap: Provides expression data and lineage information for the corresponding cell line.

⑬ L1000CDS2

⑭ L1000FWD

⑮ GDSC (Genomic of Drug Sensitivity in Cancer)

⑯ CCLE

ClinicalTrials.gov: Provides information on clinical trial progress for each drug

⑱ Cortellis: Provides information on the clinical trial progress of each drug.

⑲ The Antibody Society: Provides information on the clinical trial progress of antibodies.

⑳ PRISM: Provides large-scale drug response data across hundreds of cancer cell lines.



Input: 2024.03.31 01:08

Modified: 2024.09.29 15:40

results matching ""

    No results matching ""