This project aims to analyze and predict the 10-year risk of coronary heart disease (CHD) using patient data from the Framingham Heart Study. The dataset contains over 4,240 records with 15 attributes, allowing for exploration, feature engineering, and the development of classification models.
- Source: Framingham Heart Study
- Records: Over 4,240
- Attributes: 15 features including demographics, health metrics, and lifestyle information relevant to cardiovascular health.
The objective is to classify whether a patient is at risk of developing CHD within the next decade.
- Analysis: The age feature's distribution was analyzed using histograms and boxplots. This helped identify potential outliers in the age dataset.
- Visualizations: [Include generated plots here]
- Top Correlated Features:
- sysBP: Correlation value: [insert value]
- totChol: Correlation value: [insert value]
- BMI: Correlation value: [insert value]
- Interpretation: A strong correlation suggests a significant impact of these features on predicting CHD risk.
- Imbalanced Features Identified:
- currentSmoker
- diabetes
- Impact of Imbalance: Class imbalance can lead to biased model predictions and decreased sensitivity towards minority classes.
- Suggested Technique: Utilize SMOTE (Synthetic Minority Over-sampling Technique) to balance classes.
- Analyzed Relationship: [Specify which two numerical features were analyzed]
- Interpretation of Plots: [Insert your interpretation here]
- Encoding Applied:
- Nominal Features: One-Hot Encoding for
sex,currentSmoker,BPMeds,prevalentStroke,prevalentHyp, anddiabetes.
- Nominal Features: One-Hot Encoding for
- Scaling Type: StandardScaler applied to numerical features like
age,cigsPerDay,totChol,sysBP,diaBP,BMI,heartRate, andglucose. - Justification: Scaling is crucial for models sensitive to feature scales, particularly those using distance metrics.
- Feature Created: Pulse Pressure =
sysBP - diaBP - Relevance: Provides insights into arterial health and might correlate with CHD risk.
- Technique Used: PCA (Principal Component Analysis)
- Benefits: Reduces feature space while preserving variability, potentially improving model performance and reducing overfitting.
- Models Used:
- Logistic Regression
- Decision Tree Classifier
- Hyperparameters: [Specify the hyperparameters used for each model]
- Accuracy Results:
- Logistic Regression: Train Accuracy: [insert], Test Accuracy: [insert]
- Decision Tree: Train Accuracy: [insert], Test Accuracy: [insert]
- Best Performing Model: [Identify which model performed better on the test set along with reasons]
- Observations: Discuss any inconsistencies and whether models are overfitting or underfitting.
- Potential Solutions:
- Logistic Regression: [Suggest a solution]
- Decision Tree: [Suggest a solution]
- Recommended for Deployment: [Choose between models]
- Reasons: Consider interpretability, performance, and computational efficiency in your recommendation.
This project highlights the process from data exploration to model implementation for predicting CHD risk. Insights gained from this analysis may assist in early diagnosis and treatment strategies in clinical settings.
- Clone the repository.
- Install required packages by running
pip install -r requirements.txt. - Run the Jupyter notebooks to reproduce the analysis and results.
- The dataset from the Framingham Heart Study.
- References to literature and methodologies used in the project.