A comprehensive deep learning model for molecular toxicity prediction using a multimodal approach. The model combines molecular graphs, SMILES sequences, 2D structure images, and NMR spectra to predict toxicity across 12 different endpoints from the Tox21 dataset.
- Overview
- Features
- Project Structure
- Installation
- Data Download
- Data Preprocessing
- Running Experiments
- Results
This model leverages pre-trained molecular encoders and multimodal fusion for toxicity prediction, achieving superior performance compared to single-modal baselines on the Tox21 benchmark.
It supports:
- 4 Single-Modal Encoders: Graph (GNN), SMILES (Transformer), Image (CNN), Spectrum (1D CNN)
- 8 Multimodal Fusion Models: All pairwise and higher-order combinations of the modalities
- 12 Toxicity Endpoints: NR-AR, NR-AR-LBD, NR-AhR, NR-Aromatase, NR-ER, NR-ER-LBD, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, SR-p53
The multimodal models use self-attention fusion mechanisms to effectively combine complementary information from different molecular representations.
- Automated Hyperparameter Tuning: Optuna-based TPE optimization
- Two-Stage Training: Train/valid split optimization → Full dataset retraining
- Pre-trained Backbones:
- MoLFormer-XL for SMILES encoding
- ImageMol (ResNet18) for image encoding
- CReSS for NMR spectrum encoding
- Multi-Task Learning: Simultaneous prediction of 12 toxicity endpoints
[root]/
├── models/
│ └── models.py # All encoder and classifier models
├── utils/
│ ├── dataset.py # Dataset classes for all modalities
│ └── attention_analysis.py # Attention analysis utilities
├── experiments/
│ ├── graph/ # Graph Encoder
│ ├── smiles/ # SMILES Encoder
│ ├── image/ # Image Encoder
│ ├── spectrum/ # Spectrum Encoder
│ └── multimodal/ # Multimodal fusion experiments
│ ├── gph_smi/ # Graph + SMILES
│ ├── gph_img/ # Graph + Image
│ ├── gph_spec/ # Graph + Spectrum
│ ├── smi_img/ # SMILES + Image
│ ├── smi_spec/ # SMILES + Spectrum
│ ├── spec_img/ # Image + Spectrum
│ ├── gph_smi_img/ # Graph + SMILES + Image
│ └── moltitox/ # Graph + SMILES + Image + Spectrum
├── data/
│ ├── train.csv # Training data
│ ├── valid.csv # Validation data
│ ├── test.csv # Test data
│ ├── train_spectra.csv # Training data with spectra only
│ ├── valid_spectra.csv # Validation data with spectra only
│ ├── test_spectra.csv # Test data with spectra only
│ ├── images/ # Molecular 2D images
│ └── spectra/ # NMR spectral data
├── checkpoints/
│ ├── encoder/ # Saved encoder weights
│ │ ├── train_only/ # Encoders trained on train set only
│ │ └── train_and_valid/ # Encoders trained on train+valid
│ ├── model/ # Saved full model weights
│ ├── parameters/ # Best hyperparameters (JSON)
│ └── pretrained_models/ # Pre-trained model checkpoints
├── main.py # Run all experiments sequentially
├── requirements.txt # Python dependencies
└── README.md
- Python 3.8+
- PyTorch 1.12+
- PyTorch Geometric
- Transformers (Hugging Face)
- RDKit
- scikit-learn
- optuna
- pandas
- numpy
- PIL
- matplotlib
- seaborn
# Clone the repository
git clone https://github.com/skku-aihclab/proj25-molecule-toxicity-prediction.git
cd proj25-molecule-toxicity-prediction
# Install dependencies
pip install -r requirements.txtSee the Data Download section below for downloading datasets and pre-trained models.
All required data files, pre-trained models, and checkpoints are available on Google Drive:
Google Drive Link: https://drive.google.com/drive/folders/13QLMfp9T_C8tiHabWEwB1knieR9ZOZO9?usp=drive_link
MoltiTox/
├── data/
│ ├── 1st/
│ │ ├── train.csv
│ │ ├── valid.csv
│ │ ├── test.csv
│ │ ├── train_spectra.csv
│ │ ├── valid_spectra.csv
│ │ └── test_spectra.csv
│ ├── 2nd/
│ │ └── [same structure]
│ ├── 3rd/
│ │ └── [same structure]
│ ├── 4th/
│ │ └── [same structure]
│ └── 5th/
│ └── [same structure]
├── checkpoints/
│ ├── 1st/
│ │ ├── encoder/
│ │ ├── model/
│ │ └── parameters/
│ ├── 2nd/
│ │ └── [same structure]
│ ├── 3rd/
│ │ └── [same structure]
│ ├── 4th/
│ │ └── [same structure]
│ └── 5th/
│ └── [same structure]
├── ImageMol.pth
└── 8.pth
-
Download Dataset Images and spectra are included in this repository.
However, if you want to use the same data splits as reported in the paper, download the CSV files from Google Drive:
- Navigate to the
data/folder on Google Drive - Choose one of the 5 splits (1st through 5th)
- Download all 6 CSV files:
train.csv,valid.csv,test.csv(full datasets)train_spectra.csv,valid_spectra.csv,test_spectra.csv(spectra subset only)
- Place the downloaded CSV files in your local
data/directory
- Navigate to the
-
Download Pre-trained Checkpoints (Optional)
- Navigate to the
checkpoints/folder on Google Drive - Choose the corresponding split (1st through 5th) that matches your data
- Download the entire folder structure (
encoder/,model/,parameters/) - Place the downloaded folders in your local
checkpoints/directory - The checkpoint structure must match the original:
checkpoints/ ├── encoder/ │ ├── train_only/ │ └── train_and_valid/ ├── model/ └── parameters/
- Navigate to the
-
Download Pre-trained Backbones
- Download
ImageMol.pthfrom the Google Drive root - Place it in
experiments/image/ImageMol.pth - Download
8.pthfrom the Google Drive root - Place it in
experiments/spectrum/8.pth
- Download
-
Standard CSV files (
train.csv,valid.csv,test.csv):- Used for single-modal (graph, SMILES, image) and multimodal experiments without spectra
-
Spectra CSV files (
train_spectra.csv,valid_spectra.csv,test_spectra.csv):- Contains only molecules with available NMR spectra
- Subset of the standard datasets
The preprocessing pipeline for the Tox21 dataset is documented in data/preprocess.ipynb.
Note: The preprocessing notebook requires access to the original spectral databases (NMRShiftDB2, NP-MRD, HMDB). These databases are not included in this repository due to licensing and size constraints.
If you wish to reproduce the preprocessing pipeline or need access to the raw spectral databases, please contact me.
The pre-processed data (CSV files, images, and binary spectra) are already available and can be used directly without running the preprocessing pipeline.
To run all experiments sequentially (single-modal + multimodal):
python -u main.py 2>&1 | tee result.txtTo run a specific experiment:
# Single-modal examples
cd experiments/graph
python train.py # Train graph model
python test.py # Test graph model
cd ../smiles
python train.py # Train SMILES model
python test.py # Test SMILES model
# Multimodal examples
cd ../multimodal/gph_smi
python train.py # Train Graph+SMILES fusion model
python test.py # Test Graph+SMILES fusion model
cd ../smi_img
python train.py # Train SMILES+Image fusion model
python test.py # Test SMILES+Image fusion modelIf you use this model in your research, please cite:
@article{park2025moltitox,
title={MoltiTox: a multimodal fusion model for molecular toxicity prediction},
author={Park, Junwoo and Lee, Sujee},
journal={Frontiers in Toxicology},
volume={7},
pages={1720651},
year={2025}
}
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please contact:
- Email: jw0528@g.skku.edu
Last Updated: December 2025
