Hierarchical clustering (linkage) in SciPy - Time & Space Complexity
When using hierarchical clustering with linkage, it is important to know how the time needed grows as the data size increases.
We want to understand how the clustering steps scale with more data points.
Analyze the time complexity of the following scipy code snippet.
from scipy.cluster.hierarchy import linkage
import numpy as np
# Generate random data points
X = np.random.rand(100, 2)
# Perform hierarchical clustering using linkage
Z = linkage(X, method='ward')
This code creates 100 points and groups them step-by-step using the Ward method.
Look at what repeats as the algorithm runs.
- Primary operation: Finding the closest pair of clusters to merge.
- How many times: This happens once for each merge, so about n-1 times for n points.
As the number of points grows, the work to find closest clusters grows quickly.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | ~100 |
| 100 | ~10,000 |
| 1000 | ~1,000,000 |
Pattern observation: The operations grow roughly with the square of the input size.
Time Complexity: O(n³)
This means if you double the number of points, the time needed roughly increases by a factor of eight.
[X] Wrong: "Hierarchical clustering runs quickly even on very large datasets because it just merges clusters step-by-step."
[OK] Correct: Each merge requires checking many pairs, so the total work grows fast as data grows.
Understanding how clustering time grows helps you explain choices in data analysis and shows you can think about algorithm efficiency clearly.
"What if we used a faster method to find closest clusters at each step? How would the time complexity change?"