BabyLM ELECTRA Pre-training 🚀

This repository contains the implementation for pre-training an ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) model from scratch using the BabyLM dataset.

📌 Project Overview

This project was part of my NLP class and the goal was to explore language acquisition by training a transformer model on a limited dataset (10M tokens), similar to the constraints of the BabyLM Challenge.

🛠️ Tech Stack

Framework: PyTorch
Library: Hugging Face Transformers
Hardware: NVIDIA L40 GPU (HPC Cluster)
Tokenizer: ElectraTokenizerFast
Model: ElectraForMaskedLM

📂 Project Structure

pretraining.py: Python script for batch training on HPC clusters.
pretraining.sh: Slurm submission script for requesting GPU resources (e.g., on Borah cluster).
pretraining.ipynb: Interactive development and visualization notebook.
electra_tokenizer/: Saved tokenizer configuration and vocabulary.
data/: (Not uploaded) Directory containing training, dev, and test sets (BabyLM 10M).

🚀 Getting Started

Local Development

Clone the repository:

git clone https://github.com/YOUR_USERNAME/your-repo-name.git
cd your-repo-name

Install dependencies:
```
pip install -r requirements.txt
```
Prepare Data: Place your .train, .dev, and .test files inside the data/ folder.
Run Training:
```
python pretraining.py
```

HPC Cluster (Slurm/Borah)

To run on a cluster using Slurm, use the provided submission script:

sbatch pretraining.sh

Check the logs/ directory for output and error files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BabyLM ELECTRA Pre-training 🚀

📌 Project Overview

🛠️ Tech Stack

📂 Project Structure

🚀 Getting Started

Local Development

HPC Cluster (Slurm/Borah)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
electra_tokenizer		electra_tokenizer
.gitignore		.gitignore
README.md		README.md
pretraining.ipynb		pretraining.ipynb
pretraining.py		pretraining.py
pretraining.sh		pretraining.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

BabyLM ELECTRA Pre-training 🚀

📌 Project Overview

🛠️ Tech Stack

📂 Project Structure

🚀 Getting Started

Local Development

HPC Cluster (Slurm/Borah)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages