Dual TF-IDF (word + char) → Linear SVM (calibrated) → nested CV + randomized search → threshold policy → explainability & robustness checks.
Full case study: CASE_STUDY.md
- Leak-safe evaluation (nested CV)
- Calibrated probabilities for decision-making
- Explicit threshold policy (exported & reused in inference)
- Exported artifacts under
./artifacts/ - CLI inference via
predict.py - Minimal dependencies, CPU-friendly pipeline
- Source: UCI SMS Spam Collection (Kaggle mirror)
- Columns:
v1— original label (hamorspam)v2— raw SMS textlabel— normalized label (0 = ham, 1 = spam)text— cleaned SMS text after preprocessing
- Download the CSV from Kaggle.
- Place it here:
./data/raw/SPAM text message 20170820 - Data.csv
Dataset files are not included in this repository.
Default Kaggle path used in the notebook:
/kaggle/input/spam-text-message-classification/SPAM text message 20170820 - Data.csv
- Setup & Imports
- Load & Audit (column normalization, missing/duplicates removal)
- EDA — Distributions (class balance, message length histograms)
- Text Normalization (URLs, numbers, emails → placeholders)
- Train/Test Split (stratified)
- Baseline model (Logistic Regression)
- Dual TF-IDF + Linear SVC Pipeline
- Nested CV + randomized search
- Probability calibration (Platt scaling)
- Threshold tuning (F1-optimized by default)
- Evaluation (classification report, PR/ROC curves, calibration plots)
- Explainability (top spam/ham n-grams, FP/FN cases)
- Robustness (obfuscation stress test)
- Artifacts export (model, metrics, metadata)
Run-specific metrics and plots are exported under ./artifacts/:
- Metrics:
artifacts/metrics.json - Configuration + threshold:
artifacts/metadata.json
- Python: 3.10–3.12
Install dependencies:
pip install -r requirements.txtgit clone https://github.com/tarekmasryo/sms-spam-detection
cd sms-spam-detection
pip install -r requirements.txtPlace the dataset CSV under:
./data/raw/SPAM text message 20170820 - Data.csv
Then run:
jupyter notebook sms-spam-detection.ipynbRun the notebook once to generate artifacts under ./artifacts/, then:
python predict.py --text "win a free prize now"
python predict.py "See you at 6?"- No leakage: vectorizers fit only on training folds.
- Nested CV: outer folds provide unbiased performance.
- Calibrated SVC: converts margin scores → reliable probabilities.
- Threshold policy: exported for consistent inference.
- Robustness: includes an obfuscation stress test; TF-IDF typically weakens under heavy adversarial transforms.
MIT (code) — dataset subject to original UCI license.