Skip to content

amanaser/BabyLM-ELECTRA-Pre-training

Repository files navigation

BabyLM ELECTRA Pre-training 🚀

This repository contains the implementation for pre-training an ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) model from scratch using the BabyLM dataset.

📌 Project Overview

This project was part of my NLP class and the goal was to explore language acquisition by training a transformer model on a limited dataset (10M tokens), similar to the constraints of the BabyLM Challenge.

🛠️ Tech Stack

  • Framework: PyTorch
  • Library: Hugging Face Transformers
  • Hardware: NVIDIA L40 GPU (HPC Cluster)
  • Tokenizer: ElectraTokenizerFast
  • Model: ElectraForMaskedLM

📂 Project Structure

  • pretraining.py: Python script for batch training on HPC clusters.
  • pretraining.sh: Slurm submission script for requesting GPU resources (e.g., on Borah cluster).
  • pretraining.ipynb: Interactive development and visualization notebook.
  • electra_tokenizer/: Saved tokenizer configuration and vocabulary.
  • data/: (Not uploaded) Directory containing training, dev, and test sets (BabyLM 10M).

🚀 Getting Started

Local Development

  1. Clone the repository:

    git clone https://github.com/YOUR_USERNAME/your-repo-name.git
    cd your-repo-name
  2. Install dependencies:

    pip install -r requirements.txt
  3. Prepare Data: Place your .train, .dev, and .test files inside the data/ folder.

  4. Run Training:

    python pretraining.py

HPC Cluster (Slurm/Borah)

To run on a cluster using Slurm, use the provided submission script:

sbatch pretraining.sh

Check the logs/ directory for output and error files.

About

BabyLM ELECTRA Pre-training on NVIDIA L40 GPU Cluster.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors