Scientific Abstract Classifier using a SetFit and Contrastive Learning Fine-tuned Language Model

This repo is made for the completion of the final project for the Artificial Intelligence class. This includes all the past experiments that helped lead me to the final model but was ultimately not used or is of no relevance to the final submission, they are merely for reference and for my personal documentation/tracking.

View Final Report (PDF)

Results

The model was trained twice to reduce error and both runs produced very similar performance metrics.

Run 1 - Confusion Matrix

Shows the classification accuracy across all three scientific domains (Bioinformatics, Neuroscience, Materials Science).

Run 1 - Class Performance Metrics

Comparison of Precision, Recall, and F1-Score for each scientific domain.

Run 1 - t-SNE Embedding Visualization

2D visualization of how the model separates the three scientific domains in its learned embedding space.

Run 2 - Confusion Matrix

Second training run showing consistent classification performance.

Run 2 - Class Performance Metrics

Second run metrics confirming reproducible results.

Run 2 - t-SNE Embedding Visualization

Second run showing similar cluster separation patterns.

Note: The consistency between both runs demonstrates the robustness of the SetFit contrastive learning approach and validates the reliability of the model's performance metrics.

Model Comparisons

SetFit with Different Base Models

Comparison of SciBERT, BERT-Large, and BERT-base using the SetFit contrastive learning framework.

SciBERT (99.52%) significantly outperforms BERT-Large (98.94%) and BERT-base (93.75%), demonstrating the value of domain-specific pre-training.

SetFit vs Standard Fine-tuning

Comparison of standard fine-tuning approaches across different transformer models.

The SetFit contrastive learning approach produces tighter cluster separations compared to traditional fine-tuning methods.

Quick Start (Running the Classifier)

If you only want to run the final classifier verification and training:

Install Requirements:
```
pip install -r requirements.txt
```
Run the Main Script:
```
python JonathanSetiawan_aifinal.py
```
This will train the SetFit model on the augmented dataset, evaluate it, and generate results in results/figures/ assuming that the dataset is located correctly.

Dataset Creation (Step-by-Step)

If you wish to reproduce the entire dataset creation process from scratch, follow these steps:

0. Collect Raw Data from arXiv (Optional)

If you need to fetch fresh abstracts from the arXiv API, run the data collection script. This queries arXiv for scientific abstracts across three domains: Bioinformatics, Neuroscience, and Materials Science.

python archive/src/collect_data.py

Output: data/raw/scientific_abstracts_dataset.csv (~1,037 abstracts)

Note: The raw dataset is already included in the repository. Only run this step if you want to collect a fresh set of abstracts from arXiv.

1. Build the Associativity Map

This script analyzes the raw scientific abstracts to find domain-specific keywords and synonyms using TF-IDF.

python preprocessing/build_associativity.py

Output: data/associativity_map.json

2. Generate Augmented Dataset

This script uses the map to inject context hints into the raw abstracts, creating the final dataset.

python preprocessing/generate_augmented_data.py

Output: data/processed/phase5_augmented_dataset.csv (Note: The main script expects augmented_dataset.csv, so you may need to rename it or update the script if you regenerate it).

Project Structure

JonathanSetiawan_aifinal.py: Main execution script for training and evaluation.
data/: Contains raw and processed datasets.
preprocessing/: Scripts used to create and augment the dataset.
archive/: Contains previous experimental code and auxiliary files.
models/: Stores the trained final model.
results/figures/: Contains sample output figures from a previous run. These will be overwritten when you run the main script.

Requirements

Ensure you have the dependencies listed in requirements.txt installed. Key libraries include:

setfit
pandas
scikit-learn
matplotlib
seaborn
nltk

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
archive		archive
data		data
preprocessing		preprocessing
results/figures		results/figures
.gitignore		.gitignore
JonathanSetiawan_ai_finalreport.pdf		JonathanSetiawan_ai_finalreport.pdf
JonathanSetiawan_aifinal.py		JonathanSetiawan_aifinal.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scientific Abstract Classifier using a SetFit and Contrastive Learning Fine-tuned Language Model

Results

Run 1 - Confusion Matrix

Run 1 - Class Performance Metrics

Run 1 - t-SNE Embedding Visualization

Run 2 - Confusion Matrix

Run 2 - Class Performance Metrics

Run 2 - t-SNE Embedding Visualization

Model Comparisons

SetFit with Different Base Models

SetFit vs Standard Fine-tuning

Quick Start (Running the Classifier)

Dataset Creation (Step-by-Step)

0. Collect Raw Data from arXiv (Optional)

1. Build the Associativity Map

2. Generate Augmented Dataset

Project Structure

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scientific Abstract Classifier using a SetFit and Contrastive Learning Fine-tuned Language Model

Results

Run 1 - Confusion Matrix

Run 1 - Class Performance Metrics

Run 1 - t-SNE Embedding Visualization

Run 2 - Confusion Matrix

Run 2 - Class Performance Metrics

Run 2 - t-SNE Embedding Visualization

Model Comparisons

SetFit with Different Base Models

SetFit vs Standard Fine-tuning

Quick Start (Running the Classifier)

Dataset Creation (Step-by-Step)

0. Collect Raw Data from arXiv (Optional)

1. Build the Associativity Map

2. Generate Augmented Dataset

Project Structure

Requirements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages