0
0
MlopsHow-ToBeginner · 3 min read

How to Use KMeans Clustering with sklearn in Python

Use KMeans from sklearn.cluster by creating a model with the number of clusters, then fit it to your data using fit(). After fitting, get cluster labels with labels_ or predict new data clusters with predict().
📐

Syntax

The basic syntax to use KMeans clustering in sklearn is:

  • KMeans(n_clusters, random_state): Create the KMeans model with the number of clusters you want.
  • fit(X): Train the model on your data X.
  • labels_: Access the cluster labels assigned to each data point after fitting.
  • predict(X_new): Predict cluster labels for new data points.
python
from sklearn.cluster import KMeans

# Create KMeans model
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit model to data X
kmeans.fit(X)

# Get cluster labels
labels = kmeans.labels_

# Predict clusters for new data
new_labels = kmeans.predict(X_new)
💻

Example

This example shows how to cluster simple 2D points into 3 groups using KMeans. It fits the model, prints cluster centers, and shows labels for each point.

python
from sklearn.cluster import KMeans
import numpy as np

# Sample 2D data points
X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0],
              [5, 5], [6, 5], [5, 6]])

# Create and fit KMeans with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Print cluster centers
print('Cluster centers:')
print(kmeans.cluster_centers_)

# Print labels for each point
print('Labels:')
print(kmeans.labels_)
Output
Cluster centers: [[ 1. 2. ] [10. 2. ] [ 5.33333333 5.33333333]] Labels: [0 0 0 1 1 1 2 2 2]
⚠️

Common Pitfalls

Common mistakes when using KMeans clustering include:

  • Not scaling data when features have different units, which can distort clusters.
  • Choosing too many or too few clusters without checking results.
  • Using fit_predict() is often simpler than separate fit() and predict() calls.
  • Ignoring random state can cause different results on each run.
python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np

# Wrong: Not scaling data
X = np.array([[1, 1000], [2, 1100], [3, 1200]])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
print('Centers without scaling:', kmeans.cluster_centers_)

# Right: Scale data before clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans_scaled = KMeans(n_clusters=2, random_state=0)
kmeans_scaled.fit(X_scaled)
print('Centers with scaling:', kmeans_scaled.cluster_centers_)
Output
Centers without scaling: [[ 2. 1100.] [ 1. 1000.]] Centers with scaling: [[-1.22474487 -1.22474487] [ 1.22474487 1.22474487]]
📊

Quick Reference

Tips for using KMeans clustering effectively:

  • Always set random_state for reproducible results.
  • Use n_init (default 10) to run KMeans multiple times and pick the best result.
  • Scale your data if features vary widely in scale.
  • Use the elbow method or silhouette score to choose n_clusters.

Key Takeaways

Create a KMeans model with the desired number of clusters using KMeans(n_clusters).
Fit the model to your data with fit() and get cluster labels from labels_.
Scale your data before clustering if features have different units or scales.
Set random_state for consistent results across runs.
Use methods like the elbow method to choose the right number of clusters.