What is k means clustering in python

MlopsConceptBeginner · 3 min read

K Means Clustering in Python: What It Is and How to Use It

K Means clustering in Python is a method to group data points into clusters based on similarity using the sklearn library. It assigns each point to one of k clusters by minimizing the distance to the cluster centers.

⚙️

How It Works

K Means clustering works like sorting a bunch of mixed colored balls into k boxes where each box holds balls of similar color. It starts by picking k random points as centers (called centroids). Then, it assigns each data point to the nearest center, forming groups.

Next, it recalculates the centers by averaging all points in each group. This process repeats: reassign points to the nearest center and update centers until the groups stop changing. The goal is to make points in the same group as similar as possible, like friends sitting close together.

💻

Example

This example shows how to use KMeans from sklearn.cluster to group simple 2D points into 3 clusters.

python

from sklearn.cluster import KMeans
import numpy as np

# Sample data: 8 points with 2 features each
X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0],
              [5, 5], [6, 6]])

# Create KMeans with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit model to data
kmeans.fit(X)

# Print cluster centers
print('Cluster centers:')
print(kmeans.cluster_centers_)

# Print cluster labels for each point
print('Labels:')
print(kmeans.labels_)

Output

Cluster centers: [[10. 2.] [ 5.5 5.5] [ 1. 2.]] Labels: [2 2 2 0 0 0 1 1]

🎯

When to Use

Use k means clustering when you want to find natural groups in your data without knowing the groups beforehand. It works well for tasks like customer segmentation, grouping similar documents, or organizing images by features.

It is best for numeric data where clusters are roughly round and similar in size. Avoid it if clusters have complex shapes or very different sizes.

✅

Key Points

K Means groups data into k clusters by minimizing distance to cluster centers.
It repeats assigning points and updating centers until stable.
Requires you to choose the number of clusters k in advance.
Works best for simple, numeric, and well-separated clusters.
Implemented in Python with sklearn.cluster.KMeans.

✅

Key Takeaways

K Means clustering groups data points into k clusters based on similarity.

It iteratively assigns points to nearest centers and updates centers until stable.

You must choose the number of clusters k before running the algorithm.

Best for numeric data with roughly round, similar-sized clusters.

Use sklearn's KMeans class for easy implementation in Python.