How to Use KMeans Clustering with sklearn in Python
Use
KMeans from sklearn.cluster by creating a model with the number of clusters, then fit it to your data using fit(). After fitting, get cluster labels with labels_ or predict new data clusters with predict().Syntax
The basic syntax to use KMeans clustering in sklearn is:
KMeans(n_clusters, random_state): Create the KMeans model with the number of clusters you want.fit(X): Train the model on your dataX.labels_: Access the cluster labels assigned to each data point after fitting.predict(X_new): Predict cluster labels for new data points.
python
from sklearn.cluster import KMeans # Create KMeans model kmeans = KMeans(n_clusters=3, random_state=42) # Fit model to data X kmeans.fit(X) # Get cluster labels labels = kmeans.labels_ # Predict clusters for new data new_labels = kmeans.predict(X_new)
Example
This example shows how to cluster simple 2D points into 3 groups using KMeans. It fits the model, prints cluster centers, and shows labels for each point.
python
from sklearn.cluster import KMeans import numpy as np # Sample 2D data points X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0], [5, 5], [6, 5], [5, 6]]) # Create and fit KMeans with 3 clusters kmeans = KMeans(n_clusters=3, random_state=42) kmeans.fit(X) # Print cluster centers print('Cluster centers:') print(kmeans.cluster_centers_) # Print labels for each point print('Labels:') print(kmeans.labels_)
Output
Cluster centers:
[[ 1. 2. ]
[10. 2. ]
[ 5.33333333 5.33333333]]
Labels:
[0 0 0 1 1 1 2 2 2]
Common Pitfalls
Common mistakes when using KMeans clustering include:
- Not scaling data when features have different units, which can distort clusters.
- Choosing too many or too few clusters without checking results.
- Using
fit_predict()is often simpler than separatefit()andpredict()calls. - Ignoring random state can cause different results on each run.
python
from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler import numpy as np # Wrong: Not scaling data X = np.array([[1, 1000], [2, 1100], [3, 1200]]) kmeans = KMeans(n_clusters=2) kmeans.fit(X) print('Centers without scaling:', kmeans.cluster_centers_) # Right: Scale data before clustering scaler = StandardScaler() X_scaled = scaler.fit_transform(X) kmeans_scaled = KMeans(n_clusters=2, random_state=0) kmeans_scaled.fit(X_scaled) print('Centers with scaling:', kmeans_scaled.cluster_centers_)
Output
Centers without scaling: [[ 2. 1100.]
[ 1. 1000.]]
Centers with scaling: [[-1.22474487 -1.22474487]
[ 1.22474487 1.22474487]]
Quick Reference
Tips for using KMeans clustering effectively:
- Always set
random_statefor reproducible results. - Use
n_init(default 10) to run KMeans multiple times and pick the best result. - Scale your data if features vary widely in scale.
- Use the elbow method or silhouette score to choose
n_clusters.
Key Takeaways
Create a KMeans model with the desired number of clusters using KMeans(n_clusters).
Fit the model to your data with fit() and get cluster labels from labels_.
Scale your data before clustering if features have different units or scales.
Set random_state for consistent results across runs.
Use methods like the elbow method to choose the right number of clusters.