Python Program to Cluster Customers Using sklearn KMeans
KMeans, fitting it on customer data with model.fit(data), and getting clusters with model.labels_ to cluster customers in Python.Examples
How to Think About It
Algorithm
Code
from sklearn.cluster import KMeans import numpy as np # Sample customer data: features like age and spending data = np.array([[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]]) # Create KMeans with 2 clusters model = KMeans(n_clusters=2, random_state=42) model.fit(data) # Print cluster labels for each customer print(model.labels_)
Dry Run
Let's trace the clustering of 8 customers with features through KMeans.
Input Data
data = [[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]]
Initialize KMeans
Set n_clusters=2, random_state=42 for reproducibility.
Fit Model
Model finds 2 centers that best group the data points.
Assign Clusters
Each customer is assigned a cluster label: [0 0 0 0 1 1 1 1]
| Customer Index | Features | Cluster Label |
|---|---|---|
| 0 | [5, 3] | 0 |
| 1 | [10, 15] | 0 |
| 2 | [24, 10] | 0 |
| 3 | [30, 45] | 0 |
| 4 | [85, 70] | 1 |
| 5 | [71, 80] | 1 |
| 6 | [60, 78] | 1 |
| 7 | [55, 52] | 1 |
Why This Works
Step 1: Why KMeans?
KMeans groups data points by minimizing distance to cluster centers, making it good for customer segmentation.
Step 2: Choosing Number of Clusters
You pick how many groups (clusters) you want; here, 2 groups separate customers into two main types.
Step 3: Model Fitting
The model finds centers that best represent each cluster by iteratively updating them.
Step 4: Cluster Labels
Each customer gets a label showing which cluster they belong to, useful for targeted marketing.
Alternative Approaches
from sklearn.cluster import AgglomerativeClustering import numpy as np data = np.array([[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]]) model = AgglomerativeClustering(n_clusters=2) labels = model.fit_predict(data) print(labels)
from sklearn.cluster import DBSCAN import numpy as np data = np.array([[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]]) model = DBSCAN(eps=20, min_samples=2) labels = model.fit_predict(data) print(labels)
Complexity: O(n * k * i * d) time, O(n * d) space
Time Complexity
KMeans runs in O(n * k * i * d) where n is data points, k clusters, i iterations, and d features; iterations and clusters increase time.
Space Complexity
Stores data and cluster centers, so O(n * d) space is needed; no large extra memory is required.
Which Approach is Fastest?
KMeans is generally faster than hierarchical clustering but less flexible than DBSCAN for complex shapes.
| Approach | Time | Space | Best For |
|---|---|---|---|
| KMeans | O(n * k * i * d) | O(n * d) | Large datasets with spherical clusters |
| Agglomerative Clustering | O(n^3) | O(n^2) | Small datasets, hierarchical relationships |
| DBSCAN | O(n log n) | O(n * d) | Clusters of arbitrary shape, noise detection |