Cluster evaluation metrics in SciPy - Time & Space Complexity
When we evaluate clusters, we use metrics to check how good the grouping is.
We want to know how the time to calculate these metrics grows as data size grows.
Analyze the time complexity of the following code snippet.
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, fcluster
# data is a 2D array with n samples
D = pdist(data, metric='euclidean') # pairwise distances
Z = linkage(D, method='ward') # hierarchical clustering
labels = fcluster(Z, t=3, criterion='maxclust') # cluster labels
# Calculate silhouette score
from scipy.spatial.distance import cdist
# silhouette calculation
silhouette_vals = []
for i in range(len(data)):
same_cluster = data[labels == labels[i]]
other_clusters = data[labels != labels[i]]
a = cdist(data[i:i+1], same_cluster).mean() # mean intra-cluster distance
b = cdist(data[i:i+1], other_clusters).min() # nearest-cluster distance
silhouette_vals.append((b - a) / max(a, b))
silhouette_score = sum(silhouette_vals) / len(silhouette_vals)
This code computes clusters and then calculates the silhouette score to evaluate clustering quality.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over each data point to compute silhouette values.
- How many times: Once per data point, so n times.
- Inside the loop, distance calculations happen over subsets of data, which can be up to size n.
As the number of data points grows, the number of distance calculations grows quickly.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 100 distance calculations |
| 100 | About 10,000 distance calculations |
| 1000 | About 1,000,000 distance calculations |
Pattern observation: The operations grow roughly with the square of the input size.
Time Complexity: O(n²)
This means if you double the data points, the time to compute the silhouette score roughly quadruples.
[X] Wrong: "Calculating cluster evaluation metrics like silhouette score is fast and scales linearly with data size."
[OK] Correct: Because silhouette score requires comparing each point to many others, the number of comparisons grows much faster than the number of points.
Understanding how cluster evaluation metrics scale helps you explain trade-offs when working with big data and choosing the right methods.
What if we used a sampling method to calculate silhouette score on only a subset of points? How would the time complexity change?