K-means via scipy vs scikit-learn - Performance Comparison
We want to understand how the time it takes to run K-means clustering grows as we use more data points or clusters.
This helps us know how fast or slow the algorithm will be when using scipy compared to scikit-learn.
Analyze the time complexity of this K-means clustering code using scipy.
from scipy.cluster.vq import kmeans
import numpy as np
# Generate sample data
data = np.random.rand(1000, 2)
# Run K-means clustering
centroids, distortion = kmeans(data, 5, iter=20)
This code runs K-means on 1000 points with 5 clusters and up to 20 iterations.
Look at what repeats in the algorithm:
- Primary operation: Assigning each data point to the nearest cluster center.
- How many times: This happens every iteration, up to 20 times here.
- Also, updating cluster centers after assignments repeats each iteration.
As we increase data points or clusters, the work grows like this:
| Input Size (n points) | Approx. Operations |
|---|---|
| 10 | 10 points x 5 clusters x 20 iterations = 1,000 |
| 100 | 100 x 5 x 20 = 10,000 |
| 1000 | 1000 x 5 x 20 = 100,000 |
Pattern observation: The operations grow roughly in a straight line with the number of points and clusters multiplied by iterations.
Time Complexity: O(n x k x i)
This means the time grows proportionally with the number of points (n), clusters (k), and iterations (i).
[X] Wrong: "The time depends only on the number of data points."
[OK] Correct: The number of clusters and iterations also multiply the work, so ignoring them misses important parts of the cost.
Understanding how K-means scales helps you explain algorithm choices clearly and shows you can think about performance in real projects.
"What if we reduce the number of iterations by half? How would the time complexity change?"