0
0
SciPydata~5 mins

Cluster evaluation metrics in SciPy - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Cluster evaluation metrics
O(n²)
Understanding Time Complexity

When we evaluate clusters, we use metrics to check how good the grouping is.

We want to know how the time to calculate these metrics grows as data size grows.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.


from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, fcluster

# data is a 2D array with n samples
D = pdist(data, metric='euclidean')  # pairwise distances
Z = linkage(D, method='ward')       # hierarchical clustering
labels = fcluster(Z, t=3, criterion='maxclust')  # cluster labels

# Calculate silhouette score
from scipy.spatial.distance import cdist

# silhouette calculation
silhouette_vals = []
for i in range(len(data)):
    same_cluster = data[labels == labels[i]]
    other_clusters = data[labels != labels[i]]
    a = cdist(data[i:i+1], same_cluster).mean()  # mean intra-cluster distance
    b = cdist(data[i:i+1], other_clusters).min()  # nearest-cluster distance
    silhouette_vals.append((b - a) / max(a, b))

silhouette_score = sum(silhouette_vals) / len(silhouette_vals)

This code computes clusters and then calculates the silhouette score to evaluate clustering quality.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Loop over each data point to compute silhouette values.
  • How many times: Once per data point, so n times.
  • Inside the loop, distance calculations happen over subsets of data, which can be up to size n.
How Execution Grows With Input

As the number of data points grows, the number of distance calculations grows quickly.

Input Size (n)Approx. Operations
10About 100 distance calculations
100About 10,000 distance calculations
1000About 1,000,000 distance calculations

Pattern observation: The operations grow roughly with the square of the input size.

Final Time Complexity

Time Complexity: O(n²)

This means if you double the data points, the time to compute the silhouette score roughly quadruples.

Common Mistake

[X] Wrong: "Calculating cluster evaluation metrics like silhouette score is fast and scales linearly with data size."

[OK] Correct: Because silhouette score requires comparing each point to many others, the number of comparisons grows much faster than the number of points.

Interview Connect

Understanding how cluster evaluation metrics scale helps you explain trade-offs when working with big data and choosing the right methods.

Self-Check

What if we used a sampling method to calculate silhouette score on only a subset of points? How would the time complexity change?