Production-minded notebook for binary wellbeing risk (high_risk_flag ∈ {0,1}) derived from digital behavior signals.
Predict a calibrated wellbeing risk score in [0,1] from daily digital patterns — screen time, unlocks, sleep, notifications, activity.
Focus:
- Stable CV, not leaderboard luck
- Minimal, meaningful visuals
- Zero leakage · Deployable artifacts
- Data Loading → clean read, column sanitization, dedup, schema snapshot
- EDA (Focused) → leakage guard · top Spearman correlations · compact Plotly visuals
- Feature Engineering → ratios & interactions · winsorized z-scores · log1p transforms
- Models → Logistic Regression · Random Forest · XGBoost (auto GPU/CPU)
- Calibration → isotonic reliability fit on the best model
- Decisioning → min-cost threshold + capacity constraint (optional review rate)
- Interpretability → AUC-based permutation importance (Kaggle-safe)
- Export → calibrated model + schema + metadata + inference smoke test
Key features (examples):
screen_per_sleep_norm= device_hours_per_day / sleep_hoursunlocks_per_hour= phone_unlocks / (device_hours_per_day + 0.1)social_ratio,work_ratiofrom time splits- Winsorized z-scores for device_hours, sleep_hours, stress, anxiety, depression, happiness
- log1p transforms for unlocks, social_media_minutes, work_minutes, notifications_per_day
- ROC–AUC · Average Precision · Brier Score
- Comparative ROC/PR curves (all models)
- Reliability curve (calibrated best) · Lift curve
- AUC-based permutation importance for interpretability
Threshold grid t ∈ [0,1] using:
total_cost = FP*C_fp + FN*C_fn - TP*benefit
- Report min-cost and best-F1 thresholds
- Optional capacity constraint → choose lowest-cost threshold under review limit
All reproducible artifacts stored under artifacts/:
eda/ → schema_overview.csv, engineered_corr.csv
metrics/ → leaderboard_metrics.csv, permutation_importance_auc.csv
decision/ → decision_cost_curve.csv, thresholds_summary.csv
models/ → best_calibrated.joblib, feature_schema.json, model_info.json
inference/ → smoke_test_report.json
pip install -r requirements.txt
Plotly renderer:
import plotly.io as pio
pio.renderers.default = "notebook_connected" # auto PNG fallback (kaleido)- Target:
high_risk_flag(0/1)
If categorical →TARGET_MAP = {"Low":0, "High":1} - Default path:
/kaggle/input/digital-health-and-mental-wellness/Data.csv
.
├── predicting-wellbeing-risk.ipynb
├── data/
│ └── raw/ # put Data.csv here for local runs
├── artifacts/ # exported tables / metrics / model files
├── repo_utils/
│ └── pathing.py # local + Kaggle path helpers
├── CASE_STUDY.md
├── requirements.txt
└── .gitignore
Local (recommended):
- Put
Data.csvunderdata/raw/
Kaggle:
- Falls back to
/kaggle/input/digital-health-and-mental-wellness/Data.csv
Optional override:
- Set
DATA_PATHto a full file path (local runs).
See CASE_STUDY.md for the project story, decisions, and takeaways.