Structure-Brain Link (SBL) — Pipeline Architecture

STAGE 0

Input

Query Compound

CN1CCC[C@H]1c2cccnc2 ← SMILES string (e.g. Nicotine)

Format: SMILES string (CSV upload)

Training: BBBP + B3DB Merged ~7,800+ compounds (deduplicated)

Source 1: Martins et al. 2012 · MoleculeNet · DeepChem (BBBP)

Source 2: Meng et al. 2021 · B3DB (theochem/B3DB) · Sci. Data 8, 289

Label: BBB+ (penetrant) / BBB− (non-penetrant)

RDKit mol object

STAGE 1

Descriptor
Calculation

⚗️

Class A · Physicochemical

MW, ExactMW — molecular size
LogP — lipophilicity (Crippen)
TPSA — polar surface area
HBD / HBA — H-bond donors/acceptors
QED — drug-likeness score
BBB Score — Gupta et al. 2019 (0–6)
CNS MPO — Pfizer Wager 2010 (0–6)
Derived: TPSA/MW, LogP/TPSA, MW/HeavyAtom

~14 features

🕸️

Class B · Topological

Chi indices (0–3n, 0–1v) — connectivity
Kappa 1,2,3 — molecular shape
BertzCT, BalabanJ — complexity
Ring counts — aromatic, aliphatic, saturated
Amide bonds, Spiro, Bridgehead
Stereocenters, NHOHCount, NOCount

~25 features

⚡

Class C · Electrostatic

MaxPartialCharge, MinPartialCharge
MaxAbsCharge, MinAbsCharge
NumValenceElectrons
LabuteASA — accessible surface area
PEOE_VSA 1–3 — charge-weighted VSA
SMR_VSA, SlogP_VSA — polarisability VSA

~13 features

🔵

Class D · Binary Flags

Lipinski violations (0–4) + pass flag
Ionization class — basic / acidic / zwitterionic / neutral (one-hot)
Functional groups — amide, amine, alcohol, carboxyl, sulfone, halogen, aromatic N

~12 features

64 raw features → feature selection

STAGE 2

Feature
Selection

Step 1

Variance Filter

Remove near-zero variance features
threshold = 0.01

Step 2

Correlation Filter

Remove highly co-linear pairs
|r| > 0.90 cut-off

Step 3

Mutual Information

Rank features by MI with BBB label
Top-40, MI > 0.005

Step 4

VIF Check

Variance Inflation Factor audit
multicollinearity flag

Result

~35–40 features

Imputed (median) + StandardScaled for LR

selected feature matrix X

STAGE 3

EDA &
Statistics

📊

Mann-Whitney U Test

Two-sided test per descriptor
BBB+ vs BBB− group comparison
Significance: *** p<0.001, ** p<0.01, * p<0.05

📈

Point-Biserial Correlation

r_pb between each descriptor and BBB label
Direction: promotes (+) vs inhibits (−) BBB+
TPSA is consistently top negative predictor

🗺️

Correlation Heatmap

Pearson r matrix for top-15 features
Identifies collinear descriptor clusters

10-fold stratified CV

STAGE 4

ML Model
Training

6 Classifiers · 10-Fold Stratified Cross-Validation · Class-Balanced

XGBoost

AUC ~0.93

scale_pos_weight · n=300

Random Forest

AUC ~0.92

balanced · sqrt features · n=300

LightGBM

AUC ~0.92

num_leaves=63 · n=300

Extra Trees

AUC ~0.91

balanced · n=300

Gradient Boost

AUC ~0.90

subsample=0.8 · n=200

Logistic Reg.

AUC ~0.87

balanced · StandardScaler

Primary metric

AUC-ROC

Out-of-fold OOF predictions

Also tracked

MCC, F1, AUC-PR

Balanced accuracy, precision, recall

Ensemble

Majority Vote

Mean probability across all 6 models

Confidence Tier

🟢🔵🟡🔴

≥0.75 High · ≥0.50 Moderate · ≥0.25 Uncertain · <0.25 BBB−

TreeExplainer

STAGE 5

SHAP
Explainability

🔍

SHAP TreeExplainer

Applied to RF, XGBoost, best model
500-compound random subsample
Mean |SHAP| per feature = global importance
Direction: ↑BBB+ or ↓BBB+ per feature
Beeswarm plots: feature × SHAP value × feature magnitude

model-agnostic attribution

📌

Key SHAP Findings

TPSA — strongest negative predictor (↓ BBB+)
LogP — moderate positive predictor (↑ BBB+)
NHOHCount / HBD — H-bond donor penalty
BBB Score / CNS MPO — composite positive signals
MolMR / FractionCSP3 — structural favourability

Gini importance cross-validated

rule-based PK framework

STAGE 6

Mechanistic
PK Decomposition

PATHWAY 1

P-gp Efflux Class

Low (NER~1.0) / Medium (NER~2.2) / High (NER~26.3)
Rule-based: MW, HBD, TPSA, LogP, amide count

PATHWAY 2

fup (Plasma Unbound)

Lobell & Sivarajah 2003 approximation
Driven by LogP, ionization class, MW

PATHWAY 3

fubrain (Brain Unbound)

Fridén et al. 2010 model
Key driver: LogP (brain lipid binding)

PATHWAY 4

Kp,brain

Rodgers & Rowland 2006
Passive tissue partitioning estimate

Central PK Equation — J. Med. Chem. 2021 Tiered Framework

Kp,uu,brain = ( Kp,brain / NER ) × ( fup / fubrain ) // unbound brain-to-plasma partition

> 1.0 = brain accumulation 0.3–1.0 = good CNS exposure 0.1–0.3 = efflux-limited < 0.1 = poor CNS exposure

ODE solver RK45

STAGE 7

PBPK
Simulation

🧮

2-Compartment ODE Model

Compartments: Plasma ↔ Brain
dCp/dt = −(CL_sys + CL_passive)/Vp × Cp + (CL_passive + CL_efflux)/Vp × Cb
dCb/dt = (CL_passive/Vb) × Cp − (CL_passive + CL_efflux)/Vb × Cb
Solver: SciPy solve_ivp, RK45, IV bolus 10 mg/kg
CL_passive derived from LogP + TPSA (passive diffusion)
CL_efflux = CL_passive × (NER−1) from P-gp class

t = 0 → 24h · 200 timepoints

💊

DDI Simulation

Scenario 1: Normal (baseline)
Scenario 2: P-gp inhibited 90%
DDI Ratio: AUC_brain_inhibited / AUC_brain_normal
🔴 >5× High risk
🟡 2–5× Moderate
🟢 <2× Low risk

drug–drug interaction

decision rules applied

STAGE 8

Integrated
Decision

Rule 1

TPSA > 90 + LogP < 1

❌ Deprioritize CNS

Rule 2

P-gp = High

❌ Kill or redesign

Rule 3

P-gp = Medium + BBB+ ≥0.5

✅ Advance (acceptable efflux)

Rule 4

BBB+ prob ≥ 0.5

✅ Advance

Default

None of above

⚠️ Flag for investigation

Excel + PNG outputs

STAGE 9

Outputs

📊

Excel Report (9 sheets)

Predictions, model stats, descriptor stats, feature selection, SHAP, PK, PBPK, training data, thresholds reference

📈

13 Publication Plots

ROC/PR curves, confusion matrices, SHAP beeswarms, PBPK curves, decision dashboard, radar charts

🧬

BBB+ Probability + Class

Per-compound: best model + ensemble vote + confidence tier (🟢🔵🟡🔴)

💊

PK Profile

P-gp class, fup, fubrain, Kp,brain, Kp,uu,brain, CNS limitation classification

📉

PBPK Brain AUC

Normal + P-gp inhibited AUC, Cmax,brain, DDI ratio, DDI risk classification

✅

Go / No-Go Decision

Tiered L1 decision rule applied per compound: Advance / Deprioritize / Redesign / Flag

Why This Pipeline Is Exceptional

Beyond Classification — A Full CNS Drug Profiling System

🚫

Most Tools Stop at BBB+ / BBB−

Standard predictors (pkCSM, SwissADME, admetSAR) give you a binary label and a probability score. That tells you if a drug gets into the brain — but not how much free drug actually reaches the target. This pipeline goes further with mechanistic PK to answer the real clinical question.

🧬

Kp,uu,brain — The Right Metric

A compound can be BBB+ (permeates the barrier) yet still fail CNS efficacy because P-gp pumps it out or it binds tightly to brain lipids. Kp,uu,brain captures the actual free drug exposure at the target site — the number that correlates with pharmacodynamic effect.

⚡

P-gp Efflux Is Explicitly Modelled

P-glycoprotein is the dominant active efflux transporter at the BBB responsible for failure of most CNS drugs. This pipeline classifies P-gp substrate risk (Low/Medium/High) using structural rules, calculates NER, and propagates it through both the PK framework and PBPK simulation.

💊

DDI Simulation Is Unique

No open-source BBB predictor simulates the drug–drug interaction scenario where a co-administered P-gp inhibitor (e.g. verapamil) raises brain exposure. This pipeline models 90% P-gp inhibition and reports the AUC ratio — a direct DDI safety signal.

🔍

SHAP Makes It Interpretable

Black-box predictions have no value in medicinal chemistry. SHAP attribution tells the chemist which structural feature is penalising BBB+ (e.g. TPSA too high, too many H-bond donors) and exactly how much — enabling rational structure–activity relationship (SAR) guided redesign.

📋

Go / No-Go Decision Automation

The pipeline doesn't just output numbers — it applies literature-grounded decision rules (TPSA > 90 + low LogP = deprioritize; high P-gp = kill) to produce an actionable tier per compound. This mirrors the decision logic used in real pharma CNS triage workflows.

Structure-Brain Link(SBL) Pipeline

Structure-Brain Link
(SBL) Pipeline