This repo is made for the completion of the final project for the Artificial Intelligence class. This includes all the past experiments that helped lead me to the final model but was ultimately not used or is of no relevance to the final submission, they are merely for reference and for my personal documentation/tracking.
The model was trained twice to reduce error and both runs produced very similar performance metrics.
Shows the classification accuracy across all three scientific domains (Bioinformatics, Neuroscience, Materials Science).
Comparison of Precision, Recall, and F1-Score for each scientific domain.
2D visualization of how the model separates the three scientific domains in its learned embedding space.
Second training run showing consistent classification performance.
Second run metrics confirming reproducible results.
Second run showing similar cluster separation patterns.
Note: The consistency between both runs demonstrates the robustness of the SetFit contrastive learning approach and validates the reliability of the model's performance metrics.
Comparison of SciBERT, BERT-Large, and BERT-base using the SetFit contrastive learning framework.
SciBERT (99.52%) significantly outperforms BERT-Large (98.94%) and BERT-base (93.75%), demonstrating the value of domain-specific pre-training.
Comparison of standard fine-tuning approaches across different transformer models.
The SetFit contrastive learning approach produces tighter cluster separations compared to traditional fine-tuning methods.
If you only want to run the final classifier verification and training:
-
Install Requirements:
pip install -r requirements.txt
-
Run the Main Script:
python JonathanSetiawan_aifinal.py
This will train the SetFit model on the augmented dataset, evaluate it, and generate results in
results/figures/assuming that the dataset is located correctly.
If you wish to reproduce the entire dataset creation process from scratch, follow these steps:
If you need to fetch fresh abstracts from the arXiv API, run the data collection script. This queries arXiv for scientific abstracts across three domains: Bioinformatics, Neuroscience, and Materials Science.
python archive/src/collect_data.pyOutput: data/raw/scientific_abstracts_dataset.csv (~1,037 abstracts)
Note: The raw dataset is already included in the repository. Only run this step if you want to collect a fresh set of abstracts from arXiv.
This script analyzes the raw scientific abstracts to find domain-specific keywords and synonyms using TF-IDF.
python preprocessing/build_associativity.pyOutput: data/associativity_map.json
This script uses the map to inject context hints into the raw abstracts, creating the final dataset.
python preprocessing/generate_augmented_data.pyOutput: data/processed/phase5_augmented_dataset.csv (Note: The main script expects augmented_dataset.csv, so you may need to rename it or update the script if you regenerate it).
JonathanSetiawan_aifinal.py: Main execution script for training and evaluation.data/: Contains raw and processed datasets.preprocessing/: Scripts used to create and augment the dataset.archive/: Contains previous experimental code and auxiliary files.models/: Stores the trained final model.results/figures/: Contains sample output figures from a previous run. These will be overwritten when you run the main script.
Ensure you have the dependencies listed in requirements.txt installed. Key libraries include:
setfitpandasscikit-learnmatplotlibseabornnltk







