Breast cancer remains a leading cause of morbidity worldwide. This project applies machine learning techniques to predict breast cancer outcomes using clinical features. Logistic Regression and Gradient Boosting models were trained and evaluated with calibration curves, ROC analysis, and decision curve analysis. Results demonstrate near‑perfect separability between benign and malignant cases, with reliable probability calibration and net clinical benefit. The accompanying Streamlit dashboard provides interactive predictions, dataset upload functionality, and downloadable visualizations, supporting transparency and reproducibility.
This project uses the Breast Cancer Wisconsin (Diagnostic) dataset, originally from the
UCI Machine Learning Repository,
and accessed via Kaggle.
- Features: 30 numeric features computed from digitized images of fine needle aspirates (FNAs).
- Target: Diagnosis (Malignant vs. Benign).
- Instances: 569 samples.
The dataset is widely used for benchmarking classification algorithms in healthcare-related machine learning tasks.
breast-cancer-prediction/
│
├── dashboard/ # Streamlit dashboard and artifacts
│ ├── artifacts/ # Snapshots of visualizations and tables
│ │ ├── ROC_curve.png
│ │ ├── calibration_curve.png
│ │ └── confusion_matrix.png
│ ├── breast_cancer_dashboard.py
│ ├── breast_cancer_06_dashboard.py
│ ├── X_test.csv
│ └── y_test.csv
│
├── data/ # Raw and preprocessed datasets
│ ├── raw/
│ │ └── breast_cancer_dataset.csv
│ └── preprocessed/
│ └── breast_cancer_pruned.csv
│
├── models/ # Serialized models, thresholds, and test sets
│ ├── lr_pipeline.pkl
│ ├── gb_pipeline.pkl
│ ├── threshold_lr.pkl
│ ├── threshold_gb.pkl
│ └── test_set.pkl
│
├── notebooks/ # Jupyter notebooks for workflow stages
│ ├── breast-cancer-01_download-dataset.ipynb
│ ├── breast-cancer-02-exploratory-data-analysis.ipynb
│ ├── breast-cancer-03-preprocessing.ipynb
│ ├── breast-cancer-04-modeling.ipynb
│ ├── breast-cancer-05-reporting.ipynb
│ └── breast-cancer_06_dashboard.ipynb
│
├── .gitignore # Ignore large files and shortcuts
├── README.md # Project documentation
└── requirements.txt # Dependencies
The dataset was prepared with the following steps to ensure reproducibility and interpretability:
-
Data Cleaning
Checked for duplicates and missing values (none present). -
Feature Scaling
Standardized numeric features for comparability. -
Feature Engineering
- Ratios (e.g.,
perimeter_radius_ratio) to highlight proportional relationships - Squared terms for non-linear effects
- Normalized features to emphasize relative variation
- Interaction terms to capture clinically meaningful feature interactions
- Ratios (e.g.,
-
Pruning
Applied Variance Inflation Factor (VIF) analysis to reduce collinearity, retaining engineered features that preserve predictive signal. -
Train/Test Split
Divided into training and testing sets (e.g., 80/20 split).
To balance interpretability and predictive performance, we applied the following modeling strategies:
-
Algorithms
- Logistic Regression: chosen for transparency and clinical interpretability
- Gradient Boosting: used to capture complex, non-linear relationships
-
Workflow
- Models trained on the engineered and pruned feature set
- Hyperparameter tuning performed with cross-validation
- Evaluation conducted on a held-out test set
-
Interpretability
- Logistic Regression coefficients examined for clinical meaning
- Gradient Boosting feature importance analyzed to highlight key predictors
Model performance was assessed using multiple metrics to balance accuracy with clinical interpretability:
-
ROC Curves
Compared models on sensitivity vs. specificity trade-offs. -
AUC (Area Under Curve)
Quantified overall discriminative ability. -
Calibration Plots
Checked how well predicted probabilities aligned with actual outcomes. -
Confusion Matrix
Summarized correct vs. incorrect classifications for malignant and benign cases. -
Decision Curve Analysis
Evaluated clinical usefulness by considering net benefit across threshold probabilities.
Interpretability Focus:
- Logistic Regression coefficients were examined for clinical meaning.
- Gradient Boosting feature importance highlighted key predictors influencing malignancy risk.
Clone the repository and install dependencies:
git clone https://github.com/yasminealiosman/breast-cancer-prediction-project.git
cd breast-cancer-prediction-project
pip install -r requirements.txtnotebooks/breast-cancer-01_download-dataset.ipynbnotebooks/breast-cancer-02-exploratory-data-analysis.ipynb→ Exploratory analysis (PCA, separability, class balance checks)notebooks/breast-cancer-03-preprocessing.ipynbnotebooks/breast-cancer-04-modeling.ipynb→ Model training, threshold tuning, evaluationnotebooks/breast-cancer-05-reporting.ipynb(evaluation)
Run the Streamlit app:
streamlit run dashboard/breast_cancer_06_dashboard.pyThe dashboard supports:
- Batch scoring: Upload CSVs of patient data (preprocessed format)
- Interactive prediction: Enter single patient features
- Artifacts management: Download ROC curves, confusion matrices, calibration plots, and tuned thresholds
This project is deployed on Streamlit Cloud for easy access and sharing.
🔗 Live Dashboard Preview: Breast Cancer Prediction Dashboard
- Push the repo to GitHub.
- Go to Streamlit Cloud and connect your GitHub repository.
- Select
dashboard/breast_cancer_dashboard.pyas the entry point. - Streamlit Cloud will automatically install dependencies from
requirements.txtand launch the app.
- Logistic Regression and Gradient Boosting models
- Tuned thresholds for optimal F1 and clinical balance
- Calibration curves for probability reliability
- Decision curve analysis for net benefit evaluation
- Feature importance plots for interpretability (LR coefficients as risk factors, GB relative importance)
- Import new datasets directly into the dashboard for evaluation or prediction
- Snapshots of visualizations and tables (ROC curves, confusion matrices, calibration plots, DCA results) available for download
- Accuracy range (99–100%): May reflect PCA‑observed separability, stratified train/test splits, and small test size.
- ROC curves: Show near‑perfect separability between benign and malignant cases.
- Calibration curves: Demonstrate probability reliability — predicted risks align with observed outcomes.
- Decision curve analysis (DCA): Confirms clinical utility by showing net benefit compared to “Treat All” or “Treat None.”
- Interpretability: LR coefficients act as risk factors; GB highlights feature importance consistent with pathology markers.
Yasmine Ali-Osman
- GitHub: @yasminealiosman
- LinkedIn: Yasmine Ali-Osman
This project is licensed under the Creative Commons Attribution–NonCommercial 4.0 International License (CC BY-NC 4.0).
You are free to use, share, and adapt this work with attribution, but commercial use is not permitted.
This project is licensed under the Creative Commons Attribution–NonCommercial 4.0 International License (CC BY-NC 4.0).
If you use or adapt this work, please provide proper credit by including:
- Author: Yasmine Ali Osman
- Link to the original repository: GitHub Repo
- License notice: "Licensed under CC BY-NC 4.0 — commercial use is not permitted."
Example citation:
Breast Cancer Prediction Dashboard — Yasmine Ali Osman (CC BY-NC 4.0)