Platform methodology

Methodology

How Target Pathogen integrates genome annotation, structural evidence, druggability, off-target analysis, essentiality, ligands, and curated datasets to support bacterial target prioritization.

Integrated functionality

What the platform lets you do

A single workspace to move from a bacterial genome to a defensible shortlist of target candidates with traceable evidence.

Genome Explore a complete proteome

Load public or manually curated genomes, browse every protein, search by locus or annotation, and move between the genome table, protein detail, and structural viewer without losing context.

Filter Prioritize target candidates

Combine essentiality, human and microbiome off-target evidence, localization, functional annotation, structural confidence, druggability, and ligand evidence to focus on proteins worth reviewing.

Score Build custom scores

Create weighted scoring formulas from available evidence fields, compare ranked proteins, and reuse score definitions as the prioritization strategy evolves.

3D Inspect structures and pockets

Review experimental PDB chains, AlphaFold DB models, ColabFold models, FPocket pockets, P2Rank pockets, residues, labels, pocket scores, and alpha-sphere geometry in the 3D viewer.

Ligands Connect proteins to chemical evidence

Surface co-crystallized PDB ligands, measured ChEMBL bioactivity, proposed ZINC compounds, LigQ_2 ligand evidence, and RDKit descriptors so structural evidence can be evaluated alongside compound-level information.

Curated Preserve expert-provided datasets

Manual imports are designed to prioritize curator-provided structures, pockets, annotations, and scores, with dry-run checks that report missing files before production data is overwritten.

Overview

Analysis pipeline

24 stages run linearly. Compute-heavy stages (★) run on dedicated compute nodes.

1–3 Genome import

GenBank download from NCBI (or manual upload). Proteins, sequences, and original annotations are imported into the database.

4–7 ★ Target scoring

FastTarget runs DIAMOND BLASTP against the human proteome, gut microbiome, and the Database of Essential Genes (DEG) to generate selectivity and essentiality scores.

8–11 ★ Protein annotation

Sequence indexing and InterProScan: domain annotation, GO terms, and EC numbers from 20 integrated databases.

12–19 ★ Structural evidence

UniProt mapping, experimental PDB retrieval, AlphaFold DB download, ColabFold predictions, FPocket + P2Rank pocket detection, druggability score loading.

20–21 Cell localization

PSORTb predicts subcellular localization using an SVM trained on experimentally characterized bacterial proteins.

22–24 ★ Ligand evidence

LigQ_2 is the internal Target ligand-evidence step: it searches PDB, ChEMBL, and ZINC, then loads experimental, measured, homolog-transferred, and proposed compound records into the binders table.

Stages 2–3

Protein data

download_gbk → load_gbk

The GenBank file is downloaded from NCBI or uploaded manually. Each CDS feature is parsed and the following fields are imported:

Accession locus_tag qualifier (e.g. PA4406, VK055_0001)

Function product qualifier

Gene name gene qualifier (e.g. lpxC, envA)

Length Amino acid count from the translation qualifier

Status Annotated or hypothetical, inferred from the product description

The quality of functional annotations depends entirely on how the genome was originally annotated (PGAP/NCBI, Prokka, RAST, etc.). Target Pathogen does not modify or validate those annotations.

Stages 13, 15, 16

3D structure

fetch_experimental_structures → alphafold_unips → colabfold_predict

Each protein is assigned a single preferred structure following a strict priority order:

Experimental structure Protein Data Bank

Crystallography, cryo-EM, or NMR entries retrieved by searching PDB for all structures linked to the protein's UniProt ID. Multiple chains and entries can be switched in the viewer.

AlphaFold Database model AlphaFold DB · EBI/DeepMind

Computational model downloaded using the protein's UniProt ID. Requires a successful UniProt mapping at stage 12.

ColabFold prediction ColabFold · AlphaFold2 + MMseqs2

Generated locally for proteins with no other structural coverage. Runs on CPU (~30–60 min/protein) or GPU via SLURM.

pLDDT — model confidence

The predicted Local Distance Difference Test (pLDDT) is a per-residue confidence score produced by AlphaFold and ColabFold. The platform stores and displays the mean pLDDT across all residues.

≥ 90Very high confidence

70–89Good confidence

50–69Low confidence — possibly disordered

< 50Not reliable — avoid structural interpretation

Stage 17

Binding sites

structures_remote — FPocket + P2Rank

Two complementary tools detect and rank putative binding pockets on each protein structure. Results are stored as residue sets and visualized in the 3D viewer.

FPocket v4+

Detects surface cavities using alpha spheres derived from a Voronoi tessellation. For each pocket, computes a druggability score based on volume, hydrophobic residue fraction, solvent accessibility, and dipole moment. The score stored per-protein is the maximum across all detected pockets.

≥ 0.7Highly druggable

0.4–0.69Moderately druggable

< 0.4Low druggability

P2Rank v2.x

Applies a random forest classifier trained on ~15,000 PDB structures. For each surface point, estimates the probability of belonging to a ligand-binding site using solvent accessibility, hydrophobicity, and evolutionary conservation as descriptors.

≥ 0.5High probability

0.2–0.49Medium probability

< 0.2Low probability

Stages 10–11, 13

Functional annotation

interproscan → load_interpro → fetch_uniprot_annotations

Domains — InterProScan v5

Scans each protein sequence against 20 integrated databases and reports domain hits with residue coordinates. GO terms and EC numbers are also extracted when present in the output.

PfamConserved protein domains (most widely used)

HAMAPProkaryotic families with characterized function

NCBIfamNCBI families, includes legacy TIGRfam HMMs

PANTHERFamilies and subfamilies with evolutionary function

Gene3DStructural domains based on CATH classification

SUPERFAMILYDomains based on SCOP classification

PIRSFFull-length families with specific function

SMARTSignaling and regulatory domains

CDDNCBI Conserved Domain Database

TMHMM · PhobiusTransmembrane domains and signal peptides

SignalPSignal peptides (Gram-negative, Gram-positive, Eukaryotic)

MobiDBLiteIntrinsically disordered regions

Gene Ontology (GO) and EC numbers

GO terms and EC numbers are collected from two sources and merged: InterProScan column output (stage 11) and UniProt curated data (stage 13, requires successful UniProt mapping). If a protein has no UniProt mapping and InterProScan found no hits, EC = 0 is the expected and correct result.

Stages 4–7, 20–21

Target profile

fasttarget → psort

Four evidence signals are combined to support target prioritization. All are computationally derived — see interpretation notes at the bottom of this page.

Stage 24

Ligand evidence

ligq_remote — LigQ_2

LigQ_2 is an internal Target Pathogen Web pipeline step. It exports all protein sequences, transfers them to a compute node, and searches multiple databases for known or candidate binders. Non-relevant compounds (amino acids, water, crystallization agents, common ions) are filtered out at load time.

Evidence type	Source	Inclusion criterion	Limit
PDB co-crystal (direct)	PDB	The experimental structure contains a ligand and its UniProt record matches this protein	All
PDB via homologs	PDB	The ligand was observed in an experimental structure of a similar protein	All
ChEMBL bioactive (direct)	ChEMBL	Measured ChEMBL bioactivity is annotated to this same protein	Top 100 by pChEMBL
ChEMBL via homologs	ChEMBL	Measured ChEMBL bioactivity is annotated to a homologous protein	Top 100 by pChEMBL
ZINC proposed	ZINC	Candidate compound selected by chemical similarity to known binders; not measured activity	Top 50 by Tanimoto

pChEMBL Negative log of potency (IC50, Ki, Kd, etc.) in molar units. pChEMBL = 6 corresponds to IC50 = 1 µM. Higher = more potent.

Tanimoto Chemical similarity metric computed from molecular fingerprints. 1.0 = identical; 0.5 is the minimum threshold applied.

ZINC Public catalog of purchasable small molecules. In Target, ZINC rows are proposed compounds inferred by similarity, not experimentally confirmed binders.

Direct vs homolog Direct means the evidence maps to this protein or its UniProt record. Homolog means the evidence comes from a similar protein and should be treated as weaker.

Computed from SMILES · RDKit

Drug-likeness

Calculated at page load. Predicted values, not experimental measurements.

MW Molecular weight (Da)

Average molecular weight accounting for natural isotopic abundances. Lipinski et al. propose ≤ 500 Da for oral permeability, though many approved antibiotics exceed this (β-lactams, glycopeptides).

LogP Lipophilicity

Crippen atomic contribution method. Measures octanol/water partitioning. Range 0–3 generally favorable; > 5 suggests toxicity and metabolic issues; < −2 indicates poor membrane permeability.

TPSA Topological polar surface area (Å²)

Sum of polar atom surface areas (O, N, and their hydrogens). < 140 Å² satisfies Veber's bioavailability criterion; < 60 Å² improves permeation across bacterial double membranes.

Ro5 Lipinski Rule of Five

Four conditions: MW ≤ 500 Da, LogP ≤ 5, H-bond donors ≤ 5, H-bond acceptors ≤ 10. Violations are counted (0 = compliant). Applies primarily to oral drugs — parenteral antibiotics routinely violate Ro5.

PAINS Pan-assay interference

RDKit PAINS filter (Baell & Holloway catalog). Flags substructures prone to false positives in HTS via promiscuous reactivity, optical interference, or non-specific binding. An alert does not disqualify a compound, but requires orthogonal validation.

Important

Interpretation notes

All evidence is computational — except PDB co-crystal binders (experimental co-crystallization) and direct ChEMBL entries (measured bioactivity). Everything else is predicted, not confirmed.
Absence of a value is not absence of the property. EC = 0 can mean the protein has no characterized enzymatic activity, or that the pipeline found insufficient homology, or that the tool was not configured for that organism.
Druggability scores are structural descriptors, not predictors of biological activity or toxicity. A high score indicates a well-defined pocket with favorable physicochemical properties — it does not imply a known drug exists or that the pocket is the active site.
DEG essentiality is transferred: a protein may be essential in the DEG reference organism but not in the studied pathogen, and vice versa. Genome-specific essentiality screens are the gold standard.
AlphaFold and ColabFold models have known failure modes in intrinsically disordered regions, membrane proteins, and proteins with few homologs. Do not use model structures for pocket analysis in regions with pLDDT < 50.
Physicochemical properties are predicted from SMILES, not measured experimentally. Lipinski Ro5 is a statistical guide for oral drugs — not an absolute rule.