🩺 Breast Cancer Prediction Project

📌 Project Overview

Breast cancer remains a leading cause of morbidity worldwide. This project applies machine learning techniques to predict breast cancer outcomes using clinical features. Logistic Regression and Gradient Boosting models were trained and evaluated with calibration curves, ROC analysis, and decision curve analysis. Results demonstrate near‑perfect separability between benign and malignant cases, with reliable probability calibration and net clinical benefit. The accompanying Streamlit dashboard provides interactive predictions, dataset upload functionality, and downloadable visualizations, supporting transparency and reproducibility.

📂 Dataset Source

This project uses the Breast Cancer Wisconsin (Diagnostic) dataset, originally from the
UCI Machine Learning Repository,
and accessed via Kaggle.

Features: 30 numeric features computed from digitized images of fine needle aspirates (FNAs).
Target: Diagnosis (Malignant vs. Benign).
Instances: 569 samples.

The dataset is widely used for benchmarking classification algorithms in healthcare-related machine learning tasks.

📂 Repository Structure

breast-cancer-prediction/
│
├── dashboard/                # Streamlit dashboard and artifacts
│   ├── artifacts/            # Snapshots of visualizations and tables
│   │   ├── ROC_curve.png
│   │   ├── calibration_curve.png
│   │   └── confusion_matrix.png
│   ├── breast_cancer_dashboard.py
│   ├── breast_cancer_06_dashboard.py
│   ├── X_test.csv
│   └── y_test.csv
│
├── data/                     # Raw and preprocessed datasets
│   ├── raw/
│   │   └── breast_cancer_dataset.csv
│   └── preprocessed/
│       └── breast_cancer_pruned.csv
│
├── models/                   # Serialized models, thresholds, and test sets
│   ├── lr_pipeline.pkl
│   ├── gb_pipeline.pkl
│   ├── threshold_lr.pkl
│   ├── threshold_gb.pkl
│   └── test_set.pkl
│
├── notebooks/                # Jupyter notebooks for workflow stages
│   ├── breast-cancer-01_download-dataset.ipynb
│   ├── breast-cancer-02-exploratory-data-analysis.ipynb
│   ├── breast-cancer-03-preprocessing.ipynb
│   ├── breast-cancer-04-modeling.ipynb
│   ├── breast-cancer-05-reporting.ipynb
│   └── breast-cancer_06_dashboard.ipynb
│
├── .gitignore                # Ignore large files and shortcuts
├── README.md                 # Project documentation
└── requirements.txt          # Dependencies

🧹 Data Preprocessing

The dataset was prepared with the following steps to ensure reproducibility and interpretability:

Data Cleaning
Checked for duplicates and missing values (none present).
Feature Scaling
Standardized numeric features for comparability.
Feature Engineering
- Ratios (e.g., perimeter_radius_ratio) to highlight proportional relationships
- Squared terms for non-linear effects
- Normalized features to emphasize relative variation
- Interaction terms to capture clinically meaningful feature interactions
Pruning
Applied Variance Inflation Factor (VIF) analysis to reduce collinearity, retaining engineered features that preserve predictive signal.
Train/Test Split
Divided into training and testing sets (e.g., 80/20 split).

⚙️ Modeling Approach

To balance interpretability and predictive performance, we applied the following modeling strategies:

Algorithms
- Logistic Regression: chosen for transparency and clinical interpretability
- Gradient Boosting: used to capture complex, non-linear relationships
Workflow
- Models trained on the engineered and pruned feature set
- Hyperparameter tuning performed with cross-validation
- Evaluation conducted on a held-out test set
Interpretability
- Logistic Regression coefficients examined for clinical meaning
- Gradient Boosting feature importance analyzed to highlight key predictors

📈 Evaluation Metrics

Model performance was assessed using multiple metrics to balance accuracy with clinical interpretability:

ROC Curves
Compared models on sensitivity vs. specificity trade-offs.
AUC (Area Under Curve)
Quantified overall discriminative ability.
Calibration Plots
Checked how well predicted probabilities aligned with actual outcomes.
Confusion Matrix
Summarized correct vs. incorrect classifications for malignant and benign cases.
Decision Curve Analysis
Evaluated clinical usefulness by considering net benefit across threshold probabilities.

Interpretability Focus:

Logistic Regression coefficients were examined for clinical meaning.
Gradient Boosting feature importance highlighted key predictors influencing malignancy risk.

⚙️ Installation

Clone the repository and install dependencies:

git clone https://github.com/yasminealiosman/breast-cancer-prediction-project.git
cd breast-cancer-prediction-project
pip install -r requirements.txt

Usage

1. Run Notebooks

notebooks/breast-cancer-01_download-dataset.ipynb
notebooks/breast-cancer-02-exploratory-data-analysis.ipynb → Exploratory analysis (PCA, separability, class balance checks)
notebooks/breast-cancer-03-preprocessing.ipynb
notebooks/breast-cancer-04-modeling.ipynb → Model training, threshold tuning, evaluation
notebooks/breast-cancer-05-reporting.ipynb (evaluation)

2. Launch Dashboard Locally

Run the Streamlit app:

streamlit run dashboard/breast_cancer_06_dashboard.py

The dashboard supports:

Batch scoring: Upload CSVs of patient data (preprocessed format)
Interactive prediction: Enter single patient features
Artifacts management: Download ROC curves, confusion matrices, calibration plots, and tuned thresholds

🌐 Deployment

This project is deployed on Streamlit Cloud for easy access and sharing.

🔗 Live Dashboard Preview: Breast Cancer Prediction Dashboard

Steps to Deploy Yourself:

Push the repo to GitHub.
Go to Streamlit Cloud and connect your GitHub repository.
Select dashboard/breast_cancer_dashboard.py as the entry point.
Streamlit Cloud will automatically install dependencies from requirements.txt and launch the app.

📊 Features

Logistic Regression and Gradient Boosting models
Tuned thresholds for optimal F1 and clinical balance
Calibration curves for probability reliability
Decision curve analysis for net benefit evaluation
Feature importance plots for interpretability (LR coefficients as risk factors, GB relative importance)
Import new datasets directly into the dashboard for evaluation or prediction
Snapshots of visualizations and tables (ROC curves, confusion matrices, calibration plots, DCA results) available for download

🏥 Clinical Trust Notes

Accuracy range (99–100%): May reflect PCA‑observed separability, stratified train/test splits, and small test size.
ROC curves: Show near‑perfect separability between benign and malignant cases.
Calibration curves: Demonstrate probability reliability — predicted risks align with observed outcomes.
Decision curve analysis (DCA): Confirms clinical utility by showing net benefit compared to “Treat All” or “Treat None.”
Interpretability: LR coefficients act as risk factors; GB highlights feature importance consistent with pathology markers.

👩🏽‍💻 Author

Yasmine Ali-Osman

GitHub: @yasminealiosman
LinkedIn: Yasmine Ali-Osman

📎 License

This project is licensed under the Creative Commons Attribution–NonCommercial 4.0 International License (CC BY-NC 4.0).

You are free to use, share, and adapt this work with attribution, but commercial use is not permitted.

📎 Attribution

This project is licensed under the Creative Commons Attribution–NonCommercial 4.0 International License (CC BY-NC 4.0).

If you use or adapt this work, please provide proper credit by including:

Author: Yasmine Ali Osman
Link to the original repository: GitHub Repo
License notice: "Licensed under CC BY-NC 4.0 — commercial use is not permitted."

Example citation:

Breast Cancer Prediction Dashboard — Yasmine Ali Osman (CC BY-NC 4.0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🩺 Breast Cancer Prediction Project

📌 Project Overview

📂 Dataset Source

📂 Repository Structure

🧹 Data Preprocessing

⚙️ Modeling Approach

📈 Evaluation Metrics

⚙️ Installation

Usage

1. Run Notebooks

2. Launch Dashboard Locally

🌐 Deployment

Steps to Deploy Yourself:

📊 Features

🏥 Clinical Trust Notes

👩🏽‍💻 Author

📎 License

📎 Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
dashboard		dashboard
data		data
models		models
notebooks		notebooks
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🩺 Breast Cancer Prediction Project

📌 Project Overview

📂 Dataset Source

📂 Repository Structure

🧹 Data Preprocessing

⚙️ Modeling Approach

📈 Evaluation Metrics

⚙️ Installation

Usage

1. Run Notebooks

2. Launch Dashboard Locally

🌐 Deployment

Steps to Deploy Yourself:

📊 Features

🏥 Clinical Trust Notes

👩🏽‍💻 Author

📎 License

📎 Attribution

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages