Platform methodology

Data sources

A complete reference for how each evidence signal is generated — which tool, which pipeline stage, which parameters, and what it means for target assessment.

Overview

Analysis pipeline

24 stages run linearly. Compute-heavy stages (★) run on dedicated compute nodes.

1–3 Genome import

GenBank download from NCBI (or manual upload). Proteins, sequences, and original annotations are imported into the database.

4–7 ★ Target scoring

FastTarget runs DIAMOND BLASTP against the human proteome, gut microbiome, and the Database of Essential Genes (DEG) to generate selectivity and essentiality scores.

8–11 ★ Protein annotation

Sequence indexing and InterProScan: domain annotation, GO terms, and EC numbers from 20 integrated databases.

12–19 ★ Structural evidence

UniProt mapping, experimental PDB retrieval, AlphaFold DB download, ColabFold predictions, FPocket + P2Rank pocket detection, druggability score loading.

20–21 Cell localization

PSORTb predicts subcellular localization using an SVM trained on experimentally characterized bacterial proteins.

22–24 ★ Ligand evidence

LigQ_2 searches PDB, ChEMBL, and ZINC for binders. Results are loaded into the binders table.

Stages 2–3

Protein data

download_gbk → load_gbk

The GenBank file is downloaded from NCBI or uploaded manually. Each CDS feature is parsed and the following fields are imported:

Accession locus_tag qualifier (e.g. PA4406, VK055_0001)
Function product qualifier
Gene name gene qualifier (e.g. lpxC, envA)
Length Amino acid count from the translation qualifier
Status Annotated or hypothetical, inferred from the product description
The quality of functional annotations depends entirely on how the genome was originally annotated (PGAP/NCBI, Prokka, RAST, etc.). Target Pathogen does not modify or validate those annotations.

Stages 13, 15, 16

3D structure

fetch_experimental_structures → alphafold_unips → colabfold_predict

Each protein is assigned a single preferred structure following a strict priority order:

1
Experimental structure Protein Data Bank

Crystallography, cryo-EM, or NMR entries retrieved by searching PDB for all structures linked to the protein's UniProt ID. Multiple chains and entries can be switched in the viewer.

2
AlphaFold Database model AlphaFold DB · EBI/DeepMind

Computational model downloaded using the protein's UniProt ID. Requires a successful UniProt mapping at stage 12.

3
ColabFold prediction ColabFold · AlphaFold2 + MMseqs2

Generated locally for proteins with no other structural coverage. Runs on CPU (~30–60 min/protein) or GPU via SLURM.

pLDDT — model confidence

The predicted Local Distance Difference Test (pLDDT) is a per-residue confidence score produced by AlphaFold and ColabFold. The platform stores and displays the mean pLDDT across all residues.

≥ 90Very high confidence
70–89Good confidence
50–69Low confidence — possibly disordered
< 50Not reliable — avoid structural interpretation

Stage 17

Binding sites

structures_remote — FPocket + P2Rank

Two complementary tools detect and rank putative binding pockets on each protein structure. Results are stored as residue sets and visualized in the 3D viewer.

FPocket v4+

Detects surface cavities using alpha spheres derived from a Voronoi tessellation. For each pocket, computes a druggability score based on volume, hydrophobic residue fraction, solvent accessibility, and dipole moment. The score stored per-protein is the maximum across all detected pockets.

≥ 0.7Highly druggable
0.4–0.69Moderately druggable
< 0.4Low druggability
P2Rank v2.x

Applies a random forest classifier trained on ~15,000 PDB structures. For each surface point, estimates the probability of belonging to a ligand-binding site using solvent accessibility, hydrophobicity, and evolutionary conservation as descriptors.

≥ 0.5High probability
0.2–0.49Medium probability
< 0.2Low probability

Stages 10–11, 13

Functional annotation

interproscan → load_interpro → fetch_uniprot_annotations

Domains — InterProScan v5

Scans each protein sequence against 20 integrated databases and reports domain hits with residue coordinates. GO terms and EC numbers are also extracted when present in the output.

PfamConserved protein domains (most widely used)
HAMAPProkaryotic families with characterized function
NCBIfamNCBI families, includes legacy TIGRfam HMMs
PANTHERFamilies and subfamilies with evolutionary function
Gene3DStructural domains based on CATH classification
SUPERFAMILYDomains based on SCOP classification
PIRSFFull-length families with specific function
SMARTSignaling and regulatory domains
CDDNCBI Conserved Domain Database
TMHMM · PhobiusTransmembrane domains and signal peptides
SignalPSignal peptides (Gram-negative, Gram-positive, Eukaryotic)
MobiDBLiteIntrinsically disordered regions

Gene Ontology (GO) and EC numbers

GO terms and EC numbers are collected from two sources and merged: InterProScan column output (stage 11) and UniProt curated data (stage 13, requires successful UniProt mapping). If a protein has no UniProt mapping and InterProScan found no hits, EC = 0 is the expected and correct result.

Stages 4–7, 20–21

Target profile

fasttarget → psort

Four evidence signals are combined to support target prioritization. All are computationally derived — see interpretation notes at the bottom of this page.

Human off-target DIAMOND BLASTP · UniProt

Searches the complete human proteome. A hit is reported when e-value ≤ 1×10⁻⁵. Best-alignment identity and e-value are stored.

Desirable: no hit — reduces risk of cross-toxicity to the human host.

Gut microbiome DIAMOND BLASTP · curated gut genomes

Searches a curated reference set of human gut commensal genomes. A hit requires identity > 40% AND query coverage > 70%.

Desirable: no hit — a compound targeting a microbiome-conserved protein may disrupt beneficial gut flora.

Essentiality DIAMOND BLASTP · DEG

Searches the Database of Essential Genes — genes confirmed essential by transposon mutagenesis, deletion, or fitness competition in model organisms.

Desirable: hit — proteins with DEG homologs are more likely essential to pathogen viability.

Subcellular localization PSORTb v3

SVM classifier trained on experimentally characterized bacterial proteins. Predicts: Cytoplasmic, CytoplasmicMembrane, Periplasmic, OuterMembrane, Extracellular, or Unknown.

Context: outer membrane and extracellular proteins are more accessible; cytoplasmic targets require compounds to cross both membranes.

Stage 24

Ligand evidence

ligq_remote — LigQ_2

LigQ_2 exports all protein sequences, transfers them to a compute node, and searches multiple databases for known or candidate binders. Non-relevant compounds (amino acids, water, crystallization agents, common ions) are filtered out at load time.

Evidence type Source Inclusion criterion Limit
PDB co-crystal (direct) PDB UniProt from PDB record matches this protein All
PDB via homologs PDB UniProt from PDB record is a similar protein All
ChEMBL bioactive (direct) ChEMBL ChEMBL target UniProt matches this protein Top 100 by pChEMBL
ChEMBL via homologs ChEMBL ChEMBL target is a homologous protein Top 100 by pChEMBL
ZINC proposed ZINC Tanimoto chemical similarity ≥ 0.5 to known binders Top 50 by Tanimoto
pChEMBL Negative log of potency (IC50, Ki, Kd, etc.) in molar units. pChEMBL = 6 corresponds to IC50 = 1 µM. Higher = more potent.
Tanimoto Chemical similarity metric computed from molecular fingerprints. 1.0 = identical; 0.5 is the minimum threshold applied.

Computed from SMILES · RDKit

Drug-likeness

Calculated at page load. Predicted values, not experimental measurements.

MW Molecular weight (Da)

Average molecular weight accounting for natural isotopic abundances. Lipinski et al. propose ≤ 500 Da for oral permeability, though many approved antibiotics exceed this (β-lactams, glycopeptides).

LogP Lipophilicity

Crippen atomic contribution method. Measures octanol/water partitioning. Range 0–3 generally favorable; > 5 suggests toxicity and metabolic issues; < −2 indicates poor membrane permeability.

TPSA Topological polar surface area (Ų)

Sum of polar atom surface areas (O, N, and their hydrogens). < 140 Ų satisfies Veber's bioavailability criterion; < 60 Ų improves permeation across bacterial double membranes.

Ro5 Lipinski Rule of Five

Four conditions: MW ≤ 500 Da, LogP ≤ 5, H-bond donors ≤ 5, H-bond acceptors ≤ 10. Violations are counted (0 = compliant). Applies primarily to oral drugs — parenteral antibiotics routinely violate Ro5.

PAINS Pan-assay interference

RDKit PAINS filter (Baell & Holloway catalog). Flags substructures prone to false positives in HTS via promiscuous reactivity, optical interference, or non-specific binding. An alert does not disqualify a compound, but requires orthogonal validation.

Important

Interpretation notes

  • All evidence is computational — except PDB co-crystal binders (experimental co-crystallization) and direct ChEMBL entries (measured bioactivity). Everything else is predicted, not confirmed.
  • Absence of a value is not absence of the property. EC = 0 can mean the protein has no characterized enzymatic activity, or that the pipeline found insufficient homology, or that the tool was not configured for that organism.
  • Druggability scores are structural descriptors, not predictors of biological activity or toxicity. A high score indicates a well-defined pocket with favorable physicochemical properties — it does not imply a known drug exists or that the pocket is the active site.
  • DEG essentiality is transferred: a protein may be essential in the DEG reference organism but not in the studied pathogen, and vice versa. Genome-specific essentiality screens are the gold standard.
  • AlphaFold and ColabFold models have known failure modes in intrinsically disordered regions, membrane proteins, and proteins with few homologs. Do not use model structures for pocket analysis in regions with pLDDT < 50.
  • Physicochemical properties are predicted from SMILES, not measured experimentally. Lipinski Ro5 is a statistical guide for oral drugs — not an absolute rule.