MlopsProgramBeginner · 2 min read

Python Program to Cluster Customers Using sklearn KMeans

Use sklearn's KMeans by importing KMeans, fitting it on customer data with model.fit(data), and getting clusters with model.labels_ to cluster customers in Python.

📋

Examples

Input[[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]]

Output[0 0 0 0 1 1 1 1]

Input[[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]

Output[1 1 1 0 0 0]

Input[[0, 0], [0, 0], [0, 0]]

Output[0 0 0]

🧠

How to Think About It

To cluster customers, first collect their features like age or spending. Then choose a clustering method like KMeans that groups similar customers. Fit the model on the data, and it will assign each customer to a cluster based on feature similarity.

📐

Algorithm

Collect customer data with relevant features.

Choose the number of clusters (k) to group customers.

Create a KMeans model with k clusters.

Fit the model on the customer data.

Get cluster labels for each customer.

Use these labels to analyze or segment customers.

💻

Code

sklearn

from sklearn.cluster import KMeans
import numpy as np

# Sample customer data: features like age and spending
data = np.array([[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]])

# Create KMeans with 2 clusters
model = KMeans(n_clusters=2, random_state=42)
model.fit(data)

# Print cluster labels for each customer
print(model.labels_)

Output

[0 0 0 0 1 1 1 1]

🔍

Dry Run

Let's trace the clustering of 8 customers with features through KMeans.

Input Data

data = [[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]]

Initialize KMeans

Set n_clusters=2, random_state=42 for reproducibility.

Fit Model

Model finds 2 centers that best group the data points.

Assign Clusters

Each customer is assigned a cluster label: [0 0 0 0 1 1 1 1]

Customer Index	Features	Cluster Label
0	[5, 3]	0
1	[10, 15]	0
2	[24, 10]	0
3	[30, 45]	0
4	[85, 70]	1
5	[71, 80]	1
6	[60, 78]	1
7	[55, 52]	1

💡

Why This Works

Step 1: Why KMeans?

KMeans groups data points by minimizing distance to cluster centers, making it good for customer segmentation.

Step 2: Choosing Number of Clusters

You pick how many groups (clusters) you want; here, 2 groups separate customers into two main types.

Step 3: Model Fitting

The model finds centers that best represent each cluster by iteratively updating them.

Step 4: Cluster Labels

Each customer gets a label showing which cluster they belong to, useful for targeted marketing.

🔄

Alternative Approaches

Agglomerative Clustering

sklearn

from sklearn.cluster import AgglomerativeClustering
import numpy as np

data = np.array([[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]])
model = AgglomerativeClustering(n_clusters=2)
labels = model.fit_predict(data)
print(labels)

Hierarchical method that builds clusters step-by-step; good for small datasets but slower on large data.

DBSCAN

sklearn

from sklearn.cluster import DBSCAN
import numpy as np

data = np.array([[5, 3], [10, 15], [24, 10], [30, 45], [85, 70], [71, 80], [60, 78], [55, 52]])
model = DBSCAN(eps=20, min_samples=2)
labels = model.fit_predict(data)
print(labels)

Density-based clustering that finds clusters of any shape and marks outliers; no need to specify cluster count.

⚡

Complexity: O(n * k * i * d) time, O(n * d) space

Time Complexity

KMeans runs in O(n * k * i * d) where n is data points, k clusters, i iterations, and d features; iterations and clusters increase time.

Space Complexity

Stores data and cluster centers, so O(n * d) space is needed; no large extra memory is required.

Which Approach is Fastest?

KMeans is generally faster than hierarchical clustering but less flexible than DBSCAN for complex shapes.

Approach	Time	Space	Best For
KMeans	O(n * k * i * d)	O(n * d)	Large datasets with spherical clusters
Agglomerative Clustering	O(n^3)	O(n^2)	Small datasets, hierarchical relationships
DBSCAN	O(n log n)	O(n * d)	Clusters of arbitrary shape, noise detection

💡

Always scale your customer data before clustering for better results.

⚠️

Not choosing the right number of clusters can lead to poor customer grouping.