This project utilizes unsupervised machine learning to segment wholesale customers based on their annual spending across various product categories. By comparing K-Means and Gaussian Mixture Models (GMM), we identify distinct business profiles and evaluate the robustness of these segments in a reduced 3D feature space.
- Log Transformation: Histograms showed heavily right-skewed spending data across all categories, making Log Transformation essential for algorithm performance.
- Outlier Identification: Boxplots highlighted significant spending outliers in the 'Fresh' category, particularly within the Horeca channel.
- Feature Correlation: A correlation matrix revealed a strong 0.92 relationship between
GroceryandDetergents_Paper, justifying dimensionality reduction via PCA.
We reduced the dataset to 3 Principal Components, capturing 73.8% of total variance.
- PC1 (Retail/Bulk): Heavily weighted by Grocery and Detergents_Paper.
- PC2 (Fresh/Frozen): Driven by Fresh and Frozen food categories.
- PC3 (Geography): Almost exclusively captures the 'Region' feature (0.98 weighting).
We evaluated the optimal number of clusters using the Elbow Method for K-Means and BIC scores for GMM.
- Disagreement: The models disagreed on 218 out of 440 points (~50%), highlighting the non-spherical, overlapping nature of the customer data.
- K-Means (Hard): Created strict, distance-based circular boundaries.
- GMM (Soft): Identified elongated, density-based clusters that better account for the correlation between features like Milk and Grocery.
Based on spending averages, we identified three primary segments:
- Large Retailers: High-volume buyers of non-perishables (Detergents/Paper).
- Small-Scale Businesses: Low-volume buyers across all categories.
- Hospitality/Horeca: High-volume buyers of Fresh and Delicassen items.
- Logistics: Prioritize frequent fresh deliveries for the Hospitality segment.
- Targeting: Focus bulk-purchase promotions on the Retailer segment.
- Strategic Growth: The 218 "disputed" points represent a transitionary market where businesses may be scaling from small-scale to specialized hospitality or retail.
- Python, Pandas, NumPy
- Scikit-Learn (StandardScaler, PCA, KMeans, GaussianMixture)
- Matplotlib, Seaborn