0
0
SciPydata~10 mins

K-means via scipy vs scikit-learn - Visual Side-by-Side Comparison

Choose your learning style9 modes available
Concept Flow - K-means via scipy vs scikit-learn
Start with data points
Choose k clusters
Initialize centroids
Assign points to nearest centroid
Update centroids by averaging assigned points
Check convergence
Stop
K-means groups data into k clusters by repeating assignment of points to centroids and updating centroids until stable.
Execution Sample
SciPy
import numpy as np
from scipy.cluster.vq import kmeans, vq
from sklearn.cluster import KMeans

# Sample data
X = np.array([[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]])

# Scipy kmeans
centroids, distortion = kmeans(X, 2)
labels_scipy, _ = vq(X, centroids)

# Sklearn kmeans
kmeans_skl = KMeans(n_clusters=2, random_state=0).fit(X)
labels_skl = kmeans_skl.labels_
This code runs K-means clustering on the same data using scipy and scikit-learn, then gets cluster labels.
Execution Table
StepActionScipy centroidsScipy labelsSklearn centroidsSklearn labels
1Initialize centroids (random)[[5.5 3. ] [ 1. 2. ]]N/AN/AN/A
2Assign points to nearest centroid[[5.5 3. ] [ 1. 2. ]][1 1 1 0 0 0]N/AN/A
3Update centroids by averaging assigned points[[10. 2. ] [ 1. 2. ]]N/AN/AN/A
4Assign points to nearest centroid[[10. 2. ] [ 1. 2. ]][1 1 1 0 0 0]N/AN/A
5Converged (centroids stable)[[10. 2. ] [ 1. 2. ]][1 1 1 0 0 0]N/AN/A
6Sklearn fit completesN/AN/A[[10. 2. ] [ 1. 2. ]][1 1 1 0 0 0]
7Output labelsFinal centroids[1 1 1 0 0 0]Final centroids[1 1 1 0 0 0]
💡 Both methods converge to similar centroids and assign points to clusters accordingly.
Variable Tracker
VariableStartAfter Step 1After Step 3After Step 5Final
centroids_scipyNone[[5.5 3. ] [ 1. 2. ]][[10. 2. ] [ 1. 2. ]][[10. 2. ] [ 1. 2. ]][[10. 2. ] [ 1. 2. ]]
labels_scipyNoneN/AN/A[1 1 1 0 0 0][1 1 1 0 0 0]
centroids_sklNoneN/AN/AN/A[[10. 2. ] [ 1. 2. ]]
labels_sklNoneN/AN/AN/A[1 1 1 0 0 0]
Key Moments - 3 Insights
Why does scipy separate centroid calculation and label assignment into two steps?
Scipy's kmeans function returns centroids only; label assignment is done separately with vq. See execution_table rows 2 and 4 where labels are assigned after centroids update.
Why do both methods produce similar but not identical centroids?
Both use random initialization but sklearn fixes random_state for reproducibility. Scipy's initial centroids may differ, causing slight differences. See variable_tracker centroids_scipy vs centroids_skl.
Why is sklearn's clustering done in one fit call while scipy requires two functions?
Sklearn's KMeans class combines centroid calculation and label assignment internally in fit(), simplifying usage. Scipy splits these for flexibility. See execution_table steps 6 and 7.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 2, what are the cluster labels assigned by scipy?
A[1 0 1 0 1 0]
B[0 0 0 1 1 1]
C[1 1 1 0 0 0]
D[0 1 0 1 0 1]
💡 Hint
Check the 'Scipy labels' column at step 2 in execution_table.
At which step do scipy centroids stop changing?
AStep 5
BStep 3
CStep 1
DStep 7
💡 Hint
Look at 'Scipy centroids' column in execution_table and see when values repeat.
If we remove random_state in sklearn, what likely changes in the execution_table?
AScipy centroids will change instead
BSklearn centroids and labels may differ each run
CLabels remain the same for both methods
DExecution will stop earlier
💡 Hint
Random_state controls reproducibility in sklearn; see variable_tracker centroids_skl.
Concept Snapshot
K-means groups data into k clusters by:
- Initializing centroids
- Assigning points to nearest centroid
- Updating centroids by averaging
- Repeating until centroids stabilize
Scipy uses kmeans() + vq() separately.
Sklearn uses KMeans.fit() combining both.
Full Transcript
This visual execution compares K-means clustering using scipy and scikit-learn on the same data. The process starts with data points and choosing k clusters. Scipy's kmeans function calculates centroids, then vq assigns labels. Sklearn's KMeans.fit does both in one step. The execution table shows how centroids and labels update step-by-step until convergence. Variable tracking shows centroid values and labels changing over steps. Key moments clarify why scipy separates steps, why centroids differ slightly, and how sklearn simplifies usage. The quiz tests understanding of labels, convergence step, and effect of random_state. The snapshot summarizes the K-means iterative process and differences between scipy and sklearn usage.