0
0
MlopsProgramBeginner · 2 min read

Python Program to Cluster Customers Using sklearn KMeans

Use sklearn's KMeans by importing KMeans, fitting it on customer data with model.fit(data), and getting clusters with model.labels_ to cluster customers in Python.
📋

Examples

Input[[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]]
Output[0 0 0 0 1 1 1 1]
Input[[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]
Output[1 1 1 0 0 0]
Input[[0, 0], [0, 0], [0, 0]]
Output[0 0 0]
🧠

How to Think About It

To cluster customers, first collect their features like age or spending. Then choose a clustering method like KMeans that groups similar customers. Fit the model on the data, and it will assign each customer to a cluster based on feature similarity.
📐

Algorithm

1
Collect customer data with relevant features.
2
Choose the number of clusters (k) to group customers.
3
Create a KMeans model with k clusters.
4
Fit the model on the customer data.
5
Get cluster labels for each customer.
6
Use these labels to analyze or segment customers.
💻

Code

sklearn
from sklearn.cluster import KMeans
import numpy as np

# Sample customer data: features like age and spending
data = np.array([[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]])

# Create KMeans with 2 clusters
model = KMeans(n_clusters=2, random_state=42)
model.fit(data)

# Print cluster labels for each customer
print(model.labels_)
Output
[0 0 0 0 1 1 1 1]
🔍

Dry Run

Let's trace the clustering of 8 customers with features through KMeans.

1

Input Data

data = [[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]]

2

Initialize KMeans

Set n_clusters=2, random_state=42 for reproducibility.

3

Fit Model

Model finds 2 centers that best group the data points.

4

Assign Clusters

Each customer is assigned a cluster label: [0 0 0 0 1 1 1 1]

Customer IndexFeaturesCluster Label
0[5, 3]0
1[10, 15]0
2[24, 10]0
3[30, 45]0
4[85, 70]1
5[71, 80]1
6[60, 78]1
7[55, 52]1
💡

Why This Works

Step 1: Why KMeans?

KMeans groups data points by minimizing distance to cluster centers, making it good for customer segmentation.

Step 2: Choosing Number of Clusters

You pick how many groups (clusters) you want; here, 2 groups separate customers into two main types.

Step 3: Model Fitting

The model finds centers that best represent each cluster by iteratively updating them.

Step 4: Cluster Labels

Each customer gets a label showing which cluster they belong to, useful for targeted marketing.

🔄

Alternative Approaches

Agglomerative Clustering
sklearn
from sklearn.cluster import AgglomerativeClustering
import numpy as np

data = np.array([[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]])
model = AgglomerativeClustering(n_clusters=2)
labels = model.fit_predict(data)
print(labels)
Hierarchical method that builds clusters step-by-step; good for small datasets but slower on large data.
DBSCAN
sklearn
from sklearn.cluster import DBSCAN
import numpy as np

data = np.array([[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]])
model = DBSCAN(eps=20, min_samples=2)
labels = model.fit_predict(data)
print(labels)
Density-based clustering that finds clusters of any shape and marks outliers; no need to specify cluster count.

Complexity: O(n * k * i * d) time, O(n * d) space

Time Complexity

KMeans runs in O(n * k * i * d) where n is data points, k clusters, i iterations, and d features; iterations and clusters increase time.

Space Complexity

Stores data and cluster centers, so O(n * d) space is needed; no large extra memory is required.

Which Approach is Fastest?

KMeans is generally faster than hierarchical clustering but less flexible than DBSCAN for complex shapes.

ApproachTimeSpaceBest For
KMeansO(n * k * i * d)O(n * d)Large datasets with spherical clusters
Agglomerative ClusteringO(n^3)O(n^2)Small datasets, hierarchical relationships
DBSCANO(n log n)O(n * d)Clusters of arbitrary shape, noise detection
💡
Always scale your customer data before clustering for better results.
⚠️
Not choosing the right number of clusters can lead to poor customer grouping.