Skip to content

Adem-grp/clustering-with-pca-kmeans-gmm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Wholesale Customer Segmentation: K-Means vs. GMM

Project Overview

This project utilizes unsupervised machine learning to segment wholesale customers based on their annual spending across various product categories. By comparing K-Means and Gaussian Mixture Models (GMM), we identify distinct business profiles and evaluate the robustness of these segments in a reduced 3D feature space.

1. Exploratory Data Analysis (EDA)

  • Log Transformation: Histograms showed heavily right-skewed spending data across all categories, making Log Transformation essential for algorithm performance.
  • Outlier Identification: Boxplots highlighted significant spending outliers in the 'Fresh' category, particularly within the Horeca channel.
  • Feature Correlation: A correlation matrix revealed a strong 0.92 relationship between Grocery and Detergents_Paper, justifying dimensionality reduction via PCA.

2. Dimensionality Reduction (PCA)

We reduced the dataset to 3 Principal Components, capturing 73.8% of total variance.

  • PC1 (Retail/Bulk): Heavily weighted by Grocery and Detergents_Paper.
  • PC2 (Fresh/Frozen): Driven by Fresh and Frozen food categories.
  • PC3 (Geography): Almost exclusively captures the 'Region' feature (0.98 weighting).

3. Clustering & Model Comparison

We evaluated the optimal number of clusters using the Elbow Method for K-Means and BIC scores for GMM.

K-Means vs. GMM Results

  • Disagreement: The models disagreed on 218 out of 440 points (~50%), highlighting the non-spherical, overlapping nature of the customer data.
  • K-Means (Hard): Created strict, distance-based circular boundaries.
  • GMM (Soft): Identified elongated, density-based clusters that better account for the correlation between features like Milk and Grocery.

4. Final Customer Profiles

Based on spending averages, we identified three primary segments:

  1. Large Retailers: High-volume buyers of non-perishables (Detergents/Paper).
  2. Small-Scale Businesses: Low-volume buyers across all categories.
  3. Hospitality/Horeca: High-volume buyers of Fresh and Delicassen items.

5. Business Application

  • Logistics: Prioritize frequent fresh deliveries for the Hospitality segment.
  • Targeting: Focus bulk-purchase promotions on the Retailer segment.
  • Strategic Growth: The 218 "disputed" points represent a transitionary market where businesses may be scaling from small-scale to specialized hospitality or retail.

Technologies Used

  • Python, Pandas, NumPy
  • Scikit-Learn (StandardScaler, PCA, KMeans, GaussianMixture)
  • Matplotlib, Seaborn

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages