This repository contains a complete multi-task toxicity screening pipeline using OLMo-7B and Mistral-7B fine-tuned on the Tox21 benchmark. Developed for DeepChem GSoC 2026, this project uncovers a critical flaw in published baselines — the random vs scaffold split gap — and demonstrates how LLMs can achieve competitive generalization when evaluated honestly.
OLMo-7B QLoRA — Scaffold Split — Mean ROC-AUC: 0.7225
| Task | ROC-AUC |
|---|---|
| NR-AR | 0.7179 |
| NR-AR-LBD | 0.8454 |
| NR-AhR | 0.7312 |
| NR-Aromatase | 0.7062 |
| NR-ER | 0.6888 |
| NR-ER-LBD | 0.8326 |
| NR-PPAR-gamma | 0.7343 |
| SR-ARE | 0.6968 |
| SR-ATAD5 | 0.6411 |
| SR-HSE | 0.6352 |
| SR-MMP | 0.7394 |
| SR-p53 | 0.7010 |
| MEAN | 0.7225 |
| Model | Split | ROC-AUC |
|---|---|---|
| RF + ECFP | Random (published baseline) | 0.8183 |
| RF + ECFP | Scaffold (honest evaluation) | 0.6135 |
| Gap | 0.2048 |
This gap of 0.20 proves that most published baselines use random split — which leaks scaffold information and overestimates real-world generalization. LLM scaffold-split numbers are NOT underperforming — they are the honest numbers.
This finding directly justifies why the DeepChem GSoC project must use ScaffoldSplitter as the evaluation standard.
| Challenge | Solution | Impact |
|---|---|---|
| NaN Loss Crashes | fp32 loss upcasting + gradient clipping | Eliminated fp16 gradient underflow on imbalanced Tox21 |
| Class Imbalance | 8x oversampling of toxic minority class | Prevented bias toward non-toxic majority labels |
| Memory Bottleneck | 4-bit NF4 + gradient checkpointing | OLMo-7B fits in 16GB VRAM |
| Invalid SMILES | RDKit sanitization — dropped 8 metal-ion SMILES | Prevented tokenizer instability from [Hg+2], [Fe+2] |
| Baseline Inflation | Scaffold split enforced throughout | True out-of-distribution generalization |
| File | Description |
|---|---|
tox21_mistral_benchmark.py |
Mistral-7B training pipeline, scaffold split, 8x oversampling |
dataset.py |
Tox21 data loading, RDKit sanitization, scaffold split logic |
model.py |
OLMo-7B 4-bit NF4 wrapper with LoRA config |
train.py |
Training loop with checkpoint saving |
OLMo_Tox21_MultiTask_Final.ipynb |
Final OLMo-7B run — Mean ROC-AUC 0.7225 |
graphconv-tox21-deepchem.ipynb |
RF baseline + scaffold vs random gap discovery |
requirements.txt |
Dependencies |
pip install -r requirements.txt
# Modular pipeline (OLMo-7B)
python dataset.py # prepare data
python train.py # train model
# Mistral-7B benchmark
python tox21_mistral_benchmark.py
# Full notebook — Kaggle
# kaggle.com/sameernadeem66/graphconv-tox21-deepchem| Task | Model | Result | Repo |
|---|---|---|---|
| BACE Classification | Mistral-7B QLoRA | 0.8371 ROC-AUC | BACE Repo |
| BBBP Classification | Mistral-7B QLoRA | 0.7141 ROC-AUC | BBBP Repo |
| ClinTox Classification | Mistral-7B QLoRA | 0.9913 ROC-AUC | ClinTox Repo |
| Tox21 Multi-Task | OLMo-7B QLoRA | 0.7225 Mean ROC-AUC | This Repo |
| ESOL Regression | OLMo-7B + Reg Head | 0.8582 RMSE | [ESOL Repo] |
| SMILES Generation | OLMo-7B + RDKit TSM | 20/20 = 100% valid | Generation Repo |
---
## Proposal mein Tox21 row update karo
**Page 4 table — Tox21 row:**
Old:
Tox21 | Multi-Task | Native GraphConv | CPU/T4 | 0.6859 & 0.72 BB
New:
Tox21 | Multi-Task | OLMo-7B (4-bit QLoRA) | Kaggle T4 | 0.7225 | Key finding: RF random split = 0.8183 vs scaffold = 0.6135 — 0.20 gap proves published baselines overestimate generalization. RDKit dropped 8 invalid metal-ion SMILES [Hg+2], [Fe+2].