Flat clustering (fcluster) in SciPy - Time & Space Complexity
When using flat clustering with scipy's fcluster, we want to know how the time to assign clusters grows as we add more data points.
We ask: How does the clustering step scale with the number of points?
Analyze the time complexity of this flat clustering code snippet.
from scipy.cluster.hierarchy import linkage, fcluster
# data: array of n points
Z = linkage(data, method='ward')
clusters = fcluster(Z, t=3, criterion='maxclust')
This code creates a hierarchical clustering and then cuts it to form flat clusters.
Look at the main repeated steps:
- Primary operation: The linkage function computes distances and merges clusters repeatedly.
- How many times: It performs about n-1 merges for n points.
- The fcluster step just traverses the linkage once to assign cluster labels.
As the number of points grows, the linkage step does more work combining pairs.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | ~100 |
| 100 | ~10,000 |
| 1000 | ~1,000,000 |
Pattern observation: The work grows roughly with the square of the number of points.
Time Complexity: O(n^3)
This means if you double the number of points, the time to cluster roughly increases by a factor of eight.
[X] Wrong: "fcluster runs in linear time because it just assigns clusters."
[OK] Correct: While fcluster itself is fast, the main cost is in linkage, which grows much faster as data grows.
Understanding how clustering scales helps you explain choices in data analysis and handle larger datasets confidently.
What if we used a different linkage method that approximates distances? How would the time complexity change?