How to use KMeans clustering sklearn in python

MlopsHow-ToBeginner · 3 min read

How to Use KMeans Clustering with sklearn in Python

Use KMeans from sklearn.cluster by creating a model with the number of clusters, then fit it to your data using fit(). After fitting, get cluster labels with labels_ or predict new data clusters with predict().

📐

Syntax

The basic syntax to use KMeans clustering in sklearn is:

KMeans(n_clusters, random_state): Create the KMeans model with the number of clusters you want.
fit(X): Train the model on your data X.
labels_: Access the cluster labels assigned to each data point after fitting.
predict(X_new): Predict cluster labels for new data points.

python

from sklearn.cluster import KMeans

# Create KMeans model
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit model to data X
kmeans.fit(X)

# Get cluster labels
labels = kmeans.labels_

# Predict clusters for new data
new_labels = kmeans.predict(X_new)

💻

Example

This example shows how to cluster simple 2D points into 3 groups using KMeans. It fits the model, prints cluster centers, and shows labels for each point.

python

from sklearn.cluster import KMeans
import numpy as np

# Sample 2D data points
X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0],
              [5, 5], [6, 5], [5, 6]])

# Create and fit KMeans with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Print cluster centers
print('Cluster centers:')
print(kmeans.cluster_centers_)

# Print labels for each point
print('Labels:')
print(kmeans.labels_)

Output

Cluster centers: [[ 1. 2. ] [10. 2. ] [ 5.33333333 5.33333333]] Labels: [0 0 0 1 1 1 2 2 2]

⚠️

Common Pitfalls

Common mistakes when using KMeans clustering include:

Not scaling data when features have different units, which can distort clusters.
Choosing too many or too few clusters without checking results.
Using fit_predict() is often simpler than separate fit() and predict() calls.
Ignoring random state can cause different results on each run.

python

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np

# Wrong: Not scaling data
X = np.array([[1, 1000], [2, 1100], [3, 1200]])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
print('Centers without scaling:', kmeans.cluster_centers_)

# Right: Scale data before clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans_scaled = KMeans(n_clusters=2, random_state=0)
kmeans_scaled.fit(X_scaled)
print('Centers with scaling:', kmeans_scaled.cluster_centers_)

Output

Centers without scaling: [[ 2. 1100.] [ 1. 1000.]] Centers with scaling: [[-1.22474487 -1.22474487] [ 1.22474487 1.22474487]]

📊

Quick Reference

Tips for using KMeans clustering effectively:

Always set random_state for reproducible results.
Use n_init (default 10) to run KMeans multiple times and pick the best result.
Scale your data if features vary widely in scale.
Use the elbow method or silhouette score to choose n_clusters.

✅

Key Takeaways

Create a KMeans model with the desired number of clusters using KMeans(n_clusters).

Fit the model to your data with fit() and get cluster labels from labels_.

Scale your data before clustering if features have different units or scales.

Set random_state for consistent results across runs.

Use methods like the elbow method to choose the right number of clusters.