A clean, reproducible Self-Supervised Learning (SSL) project that demonstrates SimCLR pretraining on STL-10 unlabeled (100k images) and evaluates learned representations with standard protocols:
- kNN@K on frozen embeddings
- Linear probe (frozen encoder + linear classifier)
- UMAP visualization of embedding space
- Nearest-neighbor retrieval in embedding space (cosine similarity)
This repo is designed to be portfolio-ready:
- runs on a single GPU (e.g., RTX 2070),
- produces structured artifacts (logs / checkpoints / metrics),
- keeps training in scripts and analysis in notebooks.
Strong SimCLR run: simclr_version_4 (50 epochs)
- kNN@20 accuracy: 0.7405
- Linear-probe accuracy (20 epochs): 0.7360
All results are reproducible from artifacts in artifacts/ and summarized in:
artifacts/metrics/runs_index.csvartifacts/metrics/summary.csv
- Train SimCLR on STL-10 unlabeled using
src/train_ssl.py - Logs go to
artifacts/logs/… - Checkpoints go to
artifacts/checkpoints/… - Run registry + aggregated metrics go to
artifacts/metrics/…
src/eval_knn.py— kNN on frozen embeddings (STL-10 train → test)src/eval_linear.py— linear probe on frozen embeddings
01_augmentations_preview.ipynb— why augmentations matter in SSL02_experiments_report_fixed.ipynb— training curves / loss analysis03_umap_embeddings_fixed.ipynb— compute embeddings + UMAP visualization04_retrieval_demo.ipynb— nearest-neighbor retrieval + Hit@10 sanity check05_ssl_final_simclr.ipynb— final showcase (all key results in one notebook)
Option A: conda (recommended)
conda env create -f environment.yml
conda activate ssl_envOption B: pip
python -m venv .venv
# Windows (PowerShell):
.\.venv\Scripts\Activate.ps1
# Windows (cmd):
.venv\Scripts\activate.bat
# Git Bash:
source .venv/Scripts/activate
pip install -r requirements.txtpython -m src.train_ssl --config configs/simclr_r18_stl10_strong.yamlThis will create:
artifacts/logs/simclr/version_X/metrics.csvartifacts/checkpoints/simclr/simclr_version_X/{last.ckpt,best.ckpt}- update
artifacts/metrics/runs_index.csvandartifacts/metrics/summary.csv
kNN@20
python -m src.eval_knn --project-root . --k 20 --use bestLinear probe (20 epochs)
python -m src.eval_linear --project-root . --epochs 20 --use bestAfter running these scripts, open:
artifacts/metrics/summary.csv(updated withknn_accandlinear_acc)
Run Jupyter and open:
notebooks/05_ssl_final_simclr.ipynb
This notebook reproduces:
- training curves (loss),
- kNN + linear-probe metrics,
- UMAP embeddings,
- retrieval demo + Hit@10 sanity metric.
- Paths are handled relative to the project root (
PROJECT_ROOTin notebooks). - The repo stores:
- checkpoints (
best.ckpt,last.ckpt) - metrics logs (
metrics.csv) - run registry (
runs_index.csv) - aggregated summary (
summary.csv)
- checkpoints (
- Notebooks are analysis-only: they do not train models.
├── artifacts/
│ ├── checkpoints/
│ ├── embeddings/
│ ├── figures/
│ ├── logs/
│ └── metrics/
├── configs/
├── data/
├── lightning_logs/ # optional (legacy Lightning dir if used)
├── notebooks/
└── src/
├── data/
├── losses/
├── models/
└── utils/
- Add BYOL (non-contrastive SSL) and compare side-by-side with the same metrics (kNN + linear probe + UMAP + retrieval).
- Add FAISS indexing for large-scale retrieval (engineering upgrade; not required for this portfolio version).
- SimCLR: Chen et al., 2020 — A Simple Framework for Contrastive Learning of Visual Representations
- STL-10 dataset: Coates et al., 2011
- PyTorch Lightning for clean training loops
MIT License. See LICENSE for details.