What is Cluster evaluation metrics in SciPy?

SciPydata~5 mins

Cluster evaluation metrics in SciPy

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Cluster evaluation metrics help us check how good our groups (clusters) are. They tell us if the data points in each group are close and if different groups are well separated.

When you want to see if your clustering grouped similar customers together.

To check if your clustering of images puts similar images in the same group.

When comparing different clustering methods to pick the best one.

To measure how well your clustering matches known labels (if available).

When tuning clustering settings to improve group quality.

Syntax

SciPy

from sklearn.metrics import silhouette_score, adjusted_rand_score, davies_bouldin_score

# silhouette_score(X, labels)
# adjusted_rand_score(true_labels, predicted_labels)
# davies_bouldin_score(X, labels)

These functions need your data points (X) and cluster labels.

Some metrics need true labels to compare, others work without them.

Examples

Measures how close points in a cluster are compared to other clusters. Values near 1 mean good clusters.

SciPy

silhouette_score(X, labels)

Compares your clustering to known labels. 1 means perfect match, 0 means random.

SciPy

adjusted_rand_score(true_labels, predicted_labels)

Lower values mean better clusters with less overlap.

SciPy

davies_bouldin_score(X, labels)

Sample Program

This code creates fake data with 3 groups, clusters it, and then checks how good the clustering is using three metrics.

SciPy

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, adjusted_rand_score, davies_bouldin_score

# Create sample data with 3 clusters
X, true_labels = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

# Cluster data using KMeans
kmeans = KMeans(n_clusters=3, random_state=0)
predicted_labels = kmeans.fit_predict(X)

# Calculate metrics
sil_score = silhouette_score(X, predicted_labels)
ari_score = adjusted_rand_score(true_labels, predicted_labels)
db_score = davies_bouldin_score(X, predicted_labels)

print(f"Silhouette Score: {sil_score:.3f}")
print(f"Adjusted Rand Index: {ari_score:.3f}")
print(f"Davies-Bouldin Score: {db_score:.3f}")

OutputSuccess

Important Notes

Silhouette score ranges from -1 to 1; closer to 1 is better.

Adjusted Rand Index needs true labels; if unknown, use unsupervised metrics like silhouette.

Davies-Bouldin score is better when smaller.

Summary

Cluster evaluation metrics help measure how well your data is grouped.

Use silhouette score and Davies-Bouldin score when true labels are unknown.

Use Adjusted Rand Index to compare clustering with known labels.