This repository contains the implementation for pre-training an ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) model from scratch using the BabyLM dataset.
This project was part of my NLP class and the goal was to explore language acquisition by training a transformer model on a limited dataset (10M tokens), similar to the constraints of the BabyLM Challenge.
- Framework: PyTorch
- Library: Hugging Face Transformers
- Hardware: NVIDIA L40 GPU (HPC Cluster)
- Tokenizer: ElectraTokenizerFast
- Model: ElectraForMaskedLM
pretraining.py: Python script for batch training on HPC clusters.pretraining.sh: Slurm submission script for requesting GPU resources (e.g., on Borah cluster).pretraining.ipynb: Interactive development and visualization notebook.electra_tokenizer/: Saved tokenizer configuration and vocabulary.data/: (Not uploaded) Directory containing training, dev, and test sets (BabyLM 10M).
-
Clone the repository:
git clone https://github.com/YOUR_USERNAME/your-repo-name.git cd your-repo-name -
Install dependencies:
pip install -r requirements.txt
-
Prepare Data: Place your
.train,.dev, and.testfiles inside thedata/folder. -
Run Training:
python pretraining.py
To run on a cluster using Slurm, use the provided submission script:
sbatch pretraining.shCheck the logs/ directory for output and error files.