0
0
SciPydata~10 mins

Why clustering groups similar data in SciPy - Visual Breakdown

Choose your learning style9 modes available
Concept Flow - Why clustering groups similar data
Start with data points
Calculate distances between points
Group points close to each other
Form clusters of similar points
Output clusters
End
Clustering starts with data points, measures how close they are, groups close points, and forms clusters of similar data.
Execution Sample
SciPy
import numpy as np
from scipy.cluster.vq import kmeans, vq

points = np.array([[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]])
centroids,_ = kmeans(points, 2)
cluster_labels, _ = vq(points, centroids)
This code groups 6 points into 2 clusters using k-means clustering.
Execution Table
StepActionDetailsResult
1Input data points6 points in 2D space[[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]]
2Calculate initial centroidsRandom or first guessCentroids approx. [[1,2],[10,2]]
3Assign points to nearest centroidDistance measuredPoints 0,1,2 -> cluster 0; Points 3,4,5 -> cluster 1
4Recalculate centroidsMean of points in each clusterCentroid 0: [1,2]; Centroid 1: [10,2]
5Assign points againCheck if clusters changeSame assignment as step 3
6ConvergedClusters stableFinal clusters formed
7Output cluster labelsEach point's cluster[0,0,0,1,1,1]
💡 Clusters stable, no change in assignments
Variable Tracker
VariableStartAfter Step 3After Step 4After Step 5Final
points[[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]][[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]][[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]][[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]][[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]]
centroidsrandom or initial guess[[1,2],[10,2]][[1,2],[10,2]][[1,2],[10,2]][[1,2],[10,2]]
cluster_labelsnone[0,0,0,1,1,1][0,0,0,1,1,1][0,0,0,1,1,1][0,0,0,1,1,1]
Key Moments - 3 Insights
Why do points close to each other get the same cluster label?
Because clustering measures distance and assigns points to the nearest centroid, points close together share the same label as shown in step 3 of the execution table.
Why do centroids change during clustering?
Centroids update to the mean of points in their cluster to better represent the group, as shown in step 4 where centroids are recalculated.
When does the clustering process stop?
It stops when cluster assignments do not change between steps, meaning clusters are stable, as shown in step 6.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what cluster label does the point [10,4] get after step 3?
A1
B0
C2
DNone
💡 Hint
Check the cluster assignments in step 3 where points 3,4,5 are assigned cluster 1.
At which step do the centroids get recalculated to better represent clusters?
AStep 2
BStep 3
CStep 4
DStep 5
💡 Hint
Look at the action 'Recalculate centroids' in step 4 of the execution table.
If the points were all very far apart, how would the cluster labels change?
AAll points get the same cluster label
BEach point might get its own cluster label
CCluster labels would not change
DClustering would fail
💡 Hint
Refer to how clustering groups points by closeness in the variable_tracker and execution_table.
Concept Snapshot
Clustering groups data by similarity.
It measures distances between points.
Points close together form clusters.
Centroids represent cluster centers.
Clusters update until stable.
Output labels show group membership.
Full Transcript
Clustering is a way to group data points that are similar or close to each other. We start with data points and calculate distances between them. Then, we assign points to clusters based on which cluster center, called centroid, is nearest. After assigning, we update the centroids to be the average of points in each cluster. This process repeats until the clusters do not change anymore. The final output shows which cluster each point belongs to. This helps us understand patterns in data by grouping similar items together.