Structure-Brain Link (SBL) · Computational Drug Discovery

Structure-Brain Link
(SBL) Pipeline

BBB Permeability · Molecular Descriptors · ML · Mechanistic PK · PBPK Simulation
STAGE 0
Input
Query Compound
CN1CCC[C@H]1c2cccnc2 ← SMILES string (e.g. Nicotine)
Format: SMILES string (CSV upload)
Training: BBBP + B3DB Merged ~7,800+ compounds (deduplicated)
Source 1: Martins et al. 2012 · MoleculeNet · DeepChem (BBBP)
Source 2: Meng et al. 2021 · B3DB (theochem/B3DB) · Sci. Data 8, 289
Label: BBB+ (penetrant) / BBB− (non-penetrant)
RDKit mol object
STAGE 1
Descriptor
Calculation
⚗️
Class A · Physicochemical
  • MW, ExactMW — molecular size
  • LogP — lipophilicity (Crippen)
  • TPSA — polar surface area
  • HBD / HBA — H-bond donors/acceptors
  • QED — drug-likeness score
  • BBB Score — Gupta et al. 2019 (0–6)
  • CNS MPO — Pfizer Wager 2010 (0–6)
  • Derived: TPSA/MW, LogP/TPSA, MW/HeavyAtom
~14 features
🕸️
Class B · Topological
  • Chi indices (0–3n, 0–1v) — connectivity
  • Kappa 1,2,3 — molecular shape
  • BertzCT, BalabanJ — complexity
  • Ring counts — aromatic, aliphatic, saturated
  • Amide bonds, Spiro, Bridgehead
  • Stereocenters, NHOHCount, NOCount
~25 features
Class C · Electrostatic
  • MaxPartialCharge, MinPartialCharge
  • MaxAbsCharge, MinAbsCharge
  • NumValenceElectrons
  • LabuteASA — accessible surface area
  • PEOE_VSA 1–3 — charge-weighted VSA
  • SMR_VSA, SlogP_VSA — polarisability VSA
~13 features
🔵
Class D · Binary Flags
  • Lipinski violations (0–4) + pass flag
  • Ionization class — basic / acidic / zwitterionic / neutral (one-hot)
  • Functional groups — amide, amine, alcohol, carboxyl, sulfone, halogen, aromatic N
~12 features
64 raw features → feature selection
STAGE 2
Feature
Selection
Step 1
Variance Filter
Remove near-zero variance features
threshold = 0.01
Step 2
Correlation Filter
Remove highly co-linear pairs
|r| > 0.90 cut-off
Step 3
Mutual Information
Rank features by MI with BBB label
Top-40, MI > 0.005
Step 4
VIF Check
Variance Inflation Factor audit
multicollinearity flag
Result
~35–40 features
Imputed (median) + StandardScaled for LR
selected feature matrix X
STAGE 3
EDA &
Statistics
📊
Mann-Whitney U Test
  • Two-sided test per descriptor
  • BBB+ vs BBB− group comparison
  • Significance: *** p<0.001, ** p<0.01, * p<0.05
📈
Point-Biserial Correlation
  • r_pb between each descriptor and BBB label
  • Direction: promotes (+) vs inhibits (−) BBB+
  • TPSA is consistently top negative predictor
🗺️
Correlation Heatmap
  • Pearson r matrix for top-15 features
  • Identifies collinear descriptor clusters
10-fold stratified CV
STAGE 4
ML Model
Training
6 Classifiers · 10-Fold Stratified Cross-Validation · Class-Balanced
XGBoost
AUC ~0.93
scale_pos_weight · n=300
Random Forest
AUC ~0.92
balanced · sqrt features · n=300
LightGBM
AUC ~0.92
num_leaves=63 · n=300
Extra Trees
AUC ~0.91
balanced · n=300
Gradient Boost
AUC ~0.90
subsample=0.8 · n=200
Logistic Reg.
AUC ~0.87
balanced · StandardScaler
Primary metric
AUC-ROC
Out-of-fold OOF predictions
Also tracked
MCC, F1, AUC-PR
Balanced accuracy, precision, recall
Ensemble
Majority Vote
Mean probability across all 6 models
Confidence Tier
🟢🔵🟡🔴
≥0.75 High · ≥0.50 Moderate · ≥0.25 Uncertain · <0.25 BBB−
TreeExplainer
STAGE 5
SHAP
Explainability
🔍
SHAP TreeExplainer
  • Applied to RF, XGBoost, best model
  • 500-compound random subsample
  • Mean |SHAP| per feature = global importance
  • Direction: ↑BBB+ or ↓BBB+ per feature
  • Beeswarm plots: feature × SHAP value × feature magnitude
model-agnostic attribution
📌
Key SHAP Findings
  • TPSA — strongest negative predictor (↓ BBB+)
  • LogP — moderate positive predictor (↑ BBB+)
  • NHOHCount / HBD — H-bond donor penalty
  • BBB Score / CNS MPO — composite positive signals
  • MolMR / FractionCSP3 — structural favourability
Gini importance cross-validated
rule-based PK framework
STAGE 6
Mechanistic
PK Decomposition
PATHWAY 1
P-gp Efflux Class
Low (NER~1.0) / Medium (NER~2.2) / High (NER~26.3)
Rule-based: MW, HBD, TPSA, LogP, amide count
PATHWAY 2
fup (Plasma Unbound)
Lobell & Sivarajah 2003 approximation
Driven by LogP, ionization class, MW
PATHWAY 3
fubrain (Brain Unbound)
Fridén et al. 2010 model
Key driver: LogP (brain lipid binding)
PATHWAY 4
Kp,brain
Rodgers & Rowland 2006
Passive tissue partitioning estimate
Central PK Equation — J. Med. Chem. 2021 Tiered Framework
Kp,uu,brain = ( Kp,brain / NER ) × ( fup / fubrain )   // unbound brain-to-plasma partition
> 1.0 = brain accumulation 0.3–1.0 = good CNS exposure 0.1–0.3 = efflux-limited < 0.1 = poor CNS exposure
ODE solver RK45
STAGE 7
PBPK
Simulation
🧮
2-Compartment ODE Model
  • Compartments: Plasma ↔ Brain
  • dCp/dt = −(CL_sys + CL_passive)/Vp × Cp + (CL_passive + CL_efflux)/Vp × Cb
  • dCb/dt = (CL_passive/Vb) × Cp − (CL_passive + CL_efflux)/Vb × Cb
  • Solver: SciPy solve_ivp, RK45, IV bolus 10 mg/kg
  • CL_passive derived from LogP + TPSA (passive diffusion)
  • CL_efflux = CL_passive × (NER−1) from P-gp class
t = 0 → 24h · 200 timepoints
💊
DDI Simulation
  • Scenario 1: Normal (baseline)
  • Scenario 2: P-gp inhibited 90%
  • DDI Ratio: AUC_brain_inhibited / AUC_brain_normal
  • 🔴 >5× High risk
  • 🟡 2–5× Moderate
  • 🟢 <2× Low risk
drug–drug interaction
decision rules applied
STAGE 8
Integrated
Decision
Rule 1
TPSA > 90 + LogP < 1
❌ Deprioritize CNS
Rule 2
P-gp = High
❌ Kill or redesign
Rule 3
P-gp = Medium + BBB+ ≥0.5
✅ Advance (acceptable efflux)
Rule 4
BBB+ prob ≥ 0.5
✅ Advance
Default
None of above
⚠️ Flag for investigation
Excel + PNG outputs
STAGE 9
Outputs
📊
Excel Report (9 sheets)
Predictions, model stats, descriptor stats, feature selection, SHAP, PK, PBPK, training data, thresholds reference
📈
13 Publication Plots
ROC/PR curves, confusion matrices, SHAP beeswarms, PBPK curves, decision dashboard, radar charts
🧬
BBB+ Probability + Class
Per-compound: best model + ensemble vote + confidence tier (🟢🔵🟡🔴)
💊
PK Profile
P-gp class, fup, fubrain, Kp,brain, Kp,uu,brain, CNS limitation classification
📉
PBPK Brain AUC
Normal + P-gp inhibited AUC, Cmax,brain, DDI ratio, DDI risk classification
Go / No-Go Decision
Tiered L1 decision rule applied per compound: Advance / Deprioritize / Redesign / Flag
Why This Pipeline Is Exceptional
Beyond Classification — A Full CNS Drug Profiling System
🚫
Most Tools Stop at BBB+ / BBB−
Standard predictors (pkCSM, SwissADME, admetSAR) give you a binary label and a probability score. That tells you if a drug gets into the brain — but not how much free drug actually reaches the target. This pipeline goes further with mechanistic PK to answer the real clinical question.
🧬
Kp,uu,brain — The Right Metric
A compound can be BBB+ (permeates the barrier) yet still fail CNS efficacy because P-gp pumps it out or it binds tightly to brain lipids. Kp,uu,brain captures the actual free drug exposure at the target site — the number that correlates with pharmacodynamic effect.
P-gp Efflux Is Explicitly Modelled
P-glycoprotein is the dominant active efflux transporter at the BBB responsible for failure of most CNS drugs. This pipeline classifies P-gp substrate risk (Low/Medium/High) using structural rules, calculates NER, and propagates it through both the PK framework and PBPK simulation.
💊
DDI Simulation Is Unique
No open-source BBB predictor simulates the drug–drug interaction scenario where a co-administered P-gp inhibitor (e.g. verapamil) raises brain exposure. This pipeline models 90% P-gp inhibition and reports the AUC ratio — a direct DDI safety signal.
🔍
SHAP Makes It Interpretable
Black-box predictions have no value in medicinal chemistry. SHAP attribution tells the chemist which structural feature is penalising BBB+ (e.g. TPSA too high, too many H-bond donors) and exactly how much — enabling rational structure–activity relationship (SAR) guided redesign.
📋
Go / No-Go Decision Automation
The pipeline doesn't just output numbers — it applies literature-grounded decision rules (TPSA > 90 + low LogP = deprioritize; high P-gp = kill) to produce an actionable tier per compound. This mirrors the decision logic used in real pharma CNS triage workflows.
FIGURE 1

Schematic overview of the Blood-Brain Barrier Penetration Prediction Pipeline. Input SMILES are featurised using 64 RDKit descriptors across four classes (physicochemical, topological, electrostatic, binary). After a four-step feature selection protocol, six machine learning classifiers are trained under 10-fold stratified cross-validation on a merged training set combining the BBBP benchmark (Martins et al. 2012; MoleculeNet; DeepChem; ~2,050 compounds) and the B3DB curated database (Meng et al. 2021; ~7,800 compounds), deduplicated by canonical SMILES to yield the largest openly available BBB training corpus. SHAP TreeExplainer provides mechanistic attribution. Predicted BBB+ compounds are further profiled via rule-based P-gp efflux classification, plasma and brain unbound fractions, and Kp,uu,brain estimation (J. Med. Chem. 2021 framework). A two-compartment PBPK ODE model simulates brain concentration–time profiles under baseline and P-gp inhibition (DDI) scenarios. All results are exported to a nine-sheet colour-coded Excel report alongside 13 publication-quality plots.