SciPydata~10 mins

K-means via scipy vs scikit-learn - Visual Side-by-Side Comparison

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - K-means via scipy vs scikit-learn

Start with data points

↓

Choose k clusters

↓

Initialize centroids

↓

Assign points to nearest centroid

↓

Update centroids by averaging assigned points

↓

Check convergence

↓

Stop

K-means groups data into k clusters by repeating assignment of points to centroids and updating centroids until stable.

Execution Sample

SciPy

import numpy as np
from scipy.cluster.vq import kmeans, vq
from sklearn.cluster import KMeans

# Sample data
X = np.array([[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]])

# Scipy kmeans
centroids, distortion = kmeans(X, 2)
labels_scipy, _ = vq(X, centroids)

# Sklearn kmeans
kmeans_skl = KMeans(n_clusters=2, random_state=0).fit(X)
labels_skl = kmeans_skl.labels_

This code runs K-means clustering on the same data using scipy and scikit-learn, then gets cluster labels.

Execution Table

Step	Action	Scipy centroids	Scipy labels	Sklearn centroids	Sklearn labels
1	Initialize centroids (random)	[[5.5 3. ] [ 1. 2. ]]	N/A	N/A	N/A
2	Assign points to nearest centroid	[[5.5 3. ] [ 1. 2. ]]	[1 1 1 0 0 0]	N/A	N/A
3	Update centroids by averaging assigned points	[[10. 2. ] [ 1. 2. ]]	N/A	N/A	N/A
4	Assign points to nearest centroid	[[10. 2. ] [ 1. 2. ]]	[1 1 1 0 0 0]	N/A	N/A
5	Converged (centroids stable)	[[10. 2. ] [ 1. 2. ]]	[1 1 1 0 0 0]	N/A	N/A
6	Sklearn fit completes	N/A	N/A	[[10. 2. ] [ 1. 2. ]]	[1 1 1 0 0 0]
7	Output labels	Final centroids	[1 1 1 0 0 0]	Final centroids	[1 1 1 0 0 0]

💡 Both methods converge to similar centroids and assign points to clusters accordingly.

Variable Tracker

Variable	Start	After Step 1	After Step 3	After Step 5	Final
centroids_scipy	None	[[5.5 3. ] [ 1. 2. ]]	[[10. 2. ] [ 1. 2. ]]	[[10. 2. ] [ 1. 2. ]]	[[10. 2. ] [ 1. 2. ]]
labels_scipy	None	N/A	N/A	[1 1 1 0 0 0]	[1 1 1 0 0 0]
centroids_skl	None	N/A	N/A	N/A	[[10. 2. ] [ 1. 2. ]]
labels_skl	None	N/A	N/A	N/A	[1 1 1 0 0 0]

Key Moments - 3 Insights

Why does scipy separate centroid calculation and label assignment into two steps?

Why do both methods produce similar but not identical centroids?

Why is sklearn's clustering done in one fit call while scipy requires two functions?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at step 2, what are the cluster labels assigned by scipy?

A[1 0 1 0 1 0]

B[0 0 0 1 1 1]

C[1 1 1 0 0 0]

D[0 1 0 1 0 1]

Concept Snapshot

K-means groups data into k clusters by:
- Initializing centroids
- Assigning points to nearest centroid
- Updating centroids by averaging
- Repeating until centroids stabilize
Scipy uses kmeans() + vq() separately.
Sklearn uses KMeans.fit() combining both.

Full Transcript

This visual execution compares K-means clustering using scipy and scikit-learn on the same data. The process starts with data points and choosing k clusters. Scipy's kmeans function calculates centroids, then vq assigns labels. Sklearn's KMeans.fit does both in one step. The execution table shows how centroids and labels update step-by-step until convergence. Variable tracking shows centroid values and labels changing over steps. Key moments clarify why scipy separates steps, why centroids differ slightly, and how sklearn simplifies usage. The quiz tests understanding of labels, convergence step, and effect of random_state. The snapshot summarizes the K-means iterative process and differences between scipy and sklearn usage.