Wholesale Customer Segmentation: K-Means vs. GMM

Project Overview

This project utilizes unsupervised machine learning to segment wholesale customers based on their annual spending across various product categories. By comparing K-Means and Gaussian Mixture Models (GMM), we identify distinct business profiles and evaluate the robustness of these segments in a reduced 3D feature space.

1. Exploratory Data Analysis (EDA)

Log Transformation: Histograms showed heavily right-skewed spending data across all categories, making Log Transformation essential for algorithm performance.
Outlier Identification: Boxplots highlighted significant spending outliers in the 'Fresh' category, particularly within the Horeca channel.
Feature Correlation: A correlation matrix revealed a strong 0.92 relationship between Grocery and Detergents_Paper, justifying dimensionality reduction via PCA.

2. Dimensionality Reduction (PCA)

We reduced the dataset to 3 Principal Components, capturing 73.8% of total variance.

PC1 (Retail/Bulk): Heavily weighted by Grocery and Detergents_Paper.
PC2 (Fresh/Frozen): Driven by Fresh and Frozen food categories.
PC3 (Geography): Almost exclusively captures the 'Region' feature (0.98 weighting).

3. Clustering & Model Comparison

We evaluated the optimal number of clusters using the Elbow Method for K-Means and BIC scores for GMM.

K-Means vs. GMM Results

Disagreement: The models disagreed on 218 out of 440 points (~50%), highlighting the non-spherical, overlapping nature of the customer data.
K-Means (Hard): Created strict, distance-based circular boundaries.
GMM (Soft): Identified elongated, density-based clusters that better account for the correlation between features like Milk and Grocery.

4. Final Customer Profiles

Based on spending averages, we identified three primary segments:

Large Retailers: High-volume buyers of non-perishables (Detergents/Paper).
Small-Scale Businesses: Low-volume buyers across all categories.
Hospitality/Horeca: High-volume buyers of Fresh and Delicassen items.

5. Business Application

Logistics: Prioritize frequent fresh deliveries for the Hospitality segment.
Targeting: Focus bulk-purchase promotions on the Retailer segment.
Strategic Growth: The 218 "disputed" points represent a transitionary market where businesses may be scaling from small-scale to specialized hospitality or retail.

Technologies Used

Python, Pandas, NumPy
Scikit-Learn (StandardScaler, PCA, KMeans, GaussianMixture)
Matplotlib, Seaborn

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wholesale Customer Segmentation: K-Means vs. GMM

Project Overview

1. Exploratory Data Analysis (EDA)

2. Dimensionality Reduction (PCA)

3. Clustering & Model Comparison

K-Means vs. GMM Results

4. Final Customer Profiles

5. Business Application

Technologies Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wholesale Customer Segmentation: K-Means vs. GMM

Project Overview

1. Exploratory Data Analysis (EDA)

2. Dimensionality Reduction (PCA)

3. Clustering & Model Comparison

K-Means vs. GMM Results

4. Final Customer Profiles

5. Business Application

Technologies Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages