Why clustering groups similar data in SciPy - Performance Analysis
We want to understand how the time needed to group similar data grows as we add more data points.
How does the clustering process scale when the dataset gets bigger?
Analyze the time complexity of this clustering example using scipy.
from scipy.cluster.vq import kmeans, vq
import numpy as np
# Create random data points
data = np.random.rand(100, 2)
# Find 3 cluster centers
centroids, _ = kmeans(data, 3)
# Assign each point to a cluster
cluster_labels, _ = vq(data, centroids)
This code finds 3 groups in 100 points and assigns each point to the closest group.
Look at what repeats as data grows:
- Primary operation: Calculating distance from each point to each cluster center.
- How many times: For each of the n points, distances to k centers are computed.
As we add more points, the number of distance checks grows.
| Input Size (n) | Approx. Operations (distance checks) |
|---|---|
| 10 | 10 x 3 = 30 |
| 100 | 100 x 3 = 300 |
| 1000 | 1000 x 3 = 3000 |
Pattern observation: Operations grow directly with the number of points.
Time Complexity: O(n)
This means the time to assign points to clusters grows in a straight line as we add more points.
[X] Wrong: "Clustering time grows with the square of the number of points because all points compare to each other."
[OK] Correct: Here, each point only compares to a fixed number of cluster centers, not all other points.
Understanding how clustering scales helps you explain your approach clearly and shows you know what affects performance in real tasks.
"What if the number of clusters k also grows with the number of points n? How would the time complexity change?"