Platform methodology
Data sources
A complete reference for how each evidence signal is generated — which tool, which pipeline stage, which parameters, and what it means for target assessment.
Overview
Analysis pipeline
24 stages run linearly. Compute-heavy stages (★) run on dedicated compute nodes.
GenBank download from NCBI (or manual upload). Proteins, sequences, and original annotations are imported into the database.
FastTarget runs DIAMOND BLASTP against the human proteome, gut microbiome, and the Database of Essential Genes (DEG) to generate selectivity and essentiality scores.
Sequence indexing and InterProScan: domain annotation, GO terms, and EC numbers from 20 integrated databases.
UniProt mapping, experimental PDB retrieval, AlphaFold DB download, ColabFold predictions, FPocket + P2Rank pocket detection, druggability score loading.
PSORTb predicts subcellular localization using an SVM trained on experimentally characterized bacterial proteins.
LigQ_2 searches PDB, ChEMBL, and ZINC for binders. Results are loaded into the binders table.
Stages 2–3
Protein data
download_gbk → load_gbk
The GenBank file is downloaded from NCBI or uploaded manually. Each CDS feature is parsed and the following fields are imported:
Stages 13, 15, 16
3D structure
fetch_experimental_structures → alphafold_unips → colabfold_predict
Each protein is assigned a single preferred structure following a strict priority order:
Crystallography, cryo-EM, or NMR entries retrieved by searching PDB for all structures linked to the protein's UniProt ID. Multiple chains and entries can be switched in the viewer.
Computational model downloaded using the protein's UniProt ID. Requires a successful UniProt mapping at stage 12.
Generated locally for proteins with no other structural coverage. Runs on CPU (~30–60 min/protein) or GPU via SLURM.
pLDDT — model confidence
The predicted Local Distance Difference Test (pLDDT) is a per-residue confidence score produced by AlphaFold and ColabFold. The platform stores and displays the mean pLDDT across all residues.
Stage 17
Binding sites
structures_remote — FPocket + P2Rank
Two complementary tools detect and rank putative binding pockets on each protein structure. Results are stored as residue sets and visualized in the 3D viewer.
Detects surface cavities using alpha spheres derived from a Voronoi tessellation. For each pocket, computes a druggability score based on volume, hydrophobic residue fraction, solvent accessibility, and dipole moment. The score stored per-protein is the maximum across all detected pockets.
Applies a random forest classifier trained on ~15,000 PDB structures. For each surface point, estimates the probability of belonging to a ligand-binding site using solvent accessibility, hydrophobicity, and evolutionary conservation as descriptors.
Stages 10–11, 13
Functional annotation
interproscan → load_interpro → fetch_uniprot_annotations
Domains — InterProScan v5
Scans each protein sequence against 20 integrated databases and reports domain hits with residue coordinates. GO terms and EC numbers are also extracted when present in the output.
Gene Ontology (GO) and EC numbers
GO terms and EC numbers are collected from two sources and merged: InterProScan column output (stage 11) and UniProt curated data (stage 13, requires successful UniProt mapping). If a protein has no UniProt mapping and InterProScan found no hits, EC = 0 is the expected and correct result.
Stages 4–7, 20–21
Target profile
fasttarget → psort
Four evidence signals are combined to support target prioritization. All are computationally derived — see interpretation notes at the bottom of this page.
Searches the complete human proteome. A hit is reported when e-value ≤ 1×10⁻⁵. Best-alignment identity and e-value are stored.
Desirable: no hit — reduces risk of cross-toxicity to the human host.
Searches a curated reference set of human gut commensal genomes. A hit requires identity > 40% AND query coverage > 70%.
Desirable: no hit — a compound targeting a microbiome-conserved protein may disrupt beneficial gut flora.
Searches the Database of Essential Genes — genes confirmed essential by transposon mutagenesis, deletion, or fitness competition in model organisms.
Desirable: hit — proteins with DEG homologs are more likely essential to pathogen viability.
SVM classifier trained on experimentally characterized bacterial proteins. Predicts: Cytoplasmic, CytoplasmicMembrane, Periplasmic, OuterMembrane, Extracellular, or Unknown.
Context: outer membrane and extracellular proteins are more accessible; cytoplasmic targets require compounds to cross both membranes.
Stage 24
Ligand evidence
ligq_remote — LigQ_2
LigQ_2 exports all protein sequences, transfers them to a compute node, and searches multiple databases for known or candidate binders. Non-relevant compounds (amino acids, water, crystallization agents, common ions) are filtered out at load time.
| Evidence type | Source | Inclusion criterion | Limit |
|---|---|---|---|
| PDB co-crystal (direct) | PDB | UniProt from PDB record matches this protein | All |
| PDB via homologs | PDB | UniProt from PDB record is a similar protein | All |
| ChEMBL bioactive (direct) | ChEMBL | ChEMBL target UniProt matches this protein | Top 100 by pChEMBL |
| ChEMBL via homologs | ChEMBL | ChEMBL target is a homologous protein | Top 100 by pChEMBL |
| ZINC proposed | ZINC | Tanimoto chemical similarity ≥ 0.5 to known binders | Top 50 by Tanimoto |
Computed from SMILES · RDKit
Drug-likeness
Calculated at page load. Predicted values, not experimental measurements.
Average molecular weight accounting for natural isotopic abundances. Lipinski et al. propose ≤ 500 Da for oral permeability, though many approved antibiotics exceed this (β-lactams, glycopeptides).
Crippen atomic contribution method. Measures octanol/water partitioning. Range 0–3 generally favorable; > 5 suggests toxicity and metabolic issues; < −2 indicates poor membrane permeability.
Sum of polar atom surface areas (O, N, and their hydrogens). < 140 Ų satisfies Veber's bioavailability criterion; < 60 Ų improves permeation across bacterial double membranes.
Four conditions: MW ≤ 500 Da, LogP ≤ 5, H-bond donors ≤ 5, H-bond acceptors ≤ 10. Violations are counted (0 = compliant). Applies primarily to oral drugs — parenteral antibiotics routinely violate Ro5.
RDKit PAINS filter (Baell & Holloway catalog). Flags substructures prone to false positives in HTS via promiscuous reactivity, optical interference, or non-specific binding. An alert does not disqualify a compound, but requires orthogonal validation.
Important
Interpretation notes
- All evidence is computational — except PDB co-crystal binders (experimental co-crystallization) and direct ChEMBL entries (measured bioactivity). Everything else is predicted, not confirmed.
- Absence of a value is not absence of the property. EC = 0 can mean the protein has no characterized enzymatic activity, or that the pipeline found insufficient homology, or that the tool was not configured for that organism.
- Druggability scores are structural descriptors, not predictors of biological activity or toxicity. A high score indicates a well-defined pocket with favorable physicochemical properties — it does not imply a known drug exists or that the pocket is the active site.
- DEG essentiality is transferred: a protein may be essential in the DEG reference organism but not in the studied pathogen, and vice versa. Genome-specific essentiality screens are the gold standard.
- AlphaFold and ColabFold models have known failure modes in intrinsically disordered regions, membrane proteins, and proteins with few homologs. Do not use model structures for pocket analysis in regions with pLDDT < 50.
- Physicochemical properties are predicted from SMILES, not measured experimentally. Lipinski Ro5 is a statistical guide for oral drugs — not an absolute rule.