"Teaching Language Models to 'Read' Chemistry."
ChemLLM-Adapter is a state-of-the-art Fine-Tuning Engine designed to bridge the gap between Generative AI and Computational Toxicology. By leveraging QLoRA (Quantized Low-Rank Adaptation), this system adapts the massive OLMo-7B model to predict molecular toxicity directly from chemical sequences (SMILES), democratizing drug discovery research on consumer hardware.
| Feature | The Tech Behind It |
|---|---|
| 🧠 QLoRA Fine-Tuning | Implements 4-bit Normal Float (nf4) quantization to compress the 7B model, while attaching trainable LoRA adapters (r=32, alpha=64) for efficient learning. |
| 🧪 Molecular Tokenization | Converts raw chemical formulas (e.g., C(=O)O) into structured semantic prompts: Molecule: [SMILES] \nTask: [Assay] \nIs Toxic: ? |
| ⚖️ Smart Balancing | The dataset.py engine automatically handles the severe class imbalance in Tox21 by applying 3x Oversampling to rare toxic samples. |
| 🛡️ DeepChem Integration | Uses RDKit and DeepChem loaders to validate molecular integrity before ingestion into the neural network. |
Visualizing the Data Flow from Molecule to Prediction:
graph TD
A[🧪 Tox21 Raw Data] -->|Validation| B(dataset.py)
B -->|Tokenize| C{Molecules}
D[🤖 OLMo-7B Model] -->|Quantize| E(model.py)
C & E -->|Train Loop| F[train.py]
F --> G[🚀 Fine-Tuned Adapter]