ML Pythonml~20 mins

Cluster evaluation metrics in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Cluster evaluation metrics

Problem:You have clustered a dataset using KMeans but are unsure how well the clusters represent the data structure.

Current Metrics:Silhouette Score: 0.45, Davies-Bouldin Index: 1.2

Issue:The current cluster evaluation metrics indicate moderate clustering quality, but it's unclear if the number of clusters or clustering method is optimal.

Your Task

Improve the clustering evaluation metrics by adjusting the number of clusters and comparing different metrics to find the best cluster configuration.

You can only change the number of clusters (k) between 2 and 10.

Use KMeans clustering only.

Use silhouette score and Davies-Bouldin index for evaluation.

Hint 1

Hint 2

Hint 3

Solution

ML Python

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
import matplotlib.pyplot as plt

# Generate sample data
X, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.60, random_state=0)

sil_scores = []
db_scores = []
k_values = range(2, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0)
    labels = kmeans.fit_predict(X)
    sil = silhouette_score(X, labels)
    db = davies_bouldin_score(X, labels)
    sil_scores.append(sil)
    db_scores.append(db)

# Plotting the scores
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.plot(k_values, sil_scores, marker='o')
plt.title('Silhouette Score vs Number of Clusters')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')

plt.subplot(1,2,2)
plt.plot(k_values, db_scores, marker='o', color='red')
plt.title('Davies-Bouldin Index vs Number of Clusters')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Davies-Bouldin Index')

plt.tight_layout()
plt.show()

# Best k based on silhouette score
best_k_sil = k_values[sil_scores.index(max(sil_scores))]
# Best k based on Davies-Bouldin index
best_k_db = k_values[db_scores.index(min(db_scores))]

print(f'Best k by Silhouette Score: {best_k_sil}')
print(f'Best k by Davies-Bouldin Index: {best_k_db}')

Tested different numbers of clusters from 2 to 10.

Calculated silhouette score and Davies-Bouldin index for each k.

Plotted the scores to visually compare cluster quality.

Identified the best number of clusters based on metrics.

Results Interpretation

Initially, the silhouette score was 0.45 and Davies-Bouldin index was 1.2, indicating moderate clustering quality.

After testing multiple cluster counts, the best silhouette score improved to 0.70 and Davies-Bouldin index decreased to 0.45 at k=4 clusters.

Using cluster evaluation metrics like silhouette score and Davies-Bouldin index helps find the best number of clusters, improving how well the clusters represent the data.

Bonus Experiment

Try using a different clustering algorithm like Agglomerative Clustering and compare the evaluation metrics with KMeans.

💡 Hint

Use sklearn's AgglomerativeClustering and compute silhouette and Davies-Bouldin scores similarly to compare results.

Practice

(1/5)

1. Which of the following cluster evaluation metrics requires knowing the true labels of the data?

easy

A. Davies-Bouldin Index

B. Silhouette Score

C. Adjusted Rand Index (ARI)

D. Calinski-Harabasz Index

Cluster evaluation metrics in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand metric types

Step 2: Identify ARI as external metric

Final Answer:

Quick Check:

Solution

Step 1: Check import source

Step 2: Check function parameters

Final Answer:

Quick Check:

Solution

Step 1: Understand Davies-Bouldin Index meaning

Step 2: Calculate score using sklearn

Final Answer:

Quick Check:

Solution

Step 1: Check input lengths

Step 2: Understand silhouette_score input requirements

Final Answer:

Quick Check:

Solution

Step 1: Identify metrics that do not require true labels

Step 2: Understand other metrics need true labels

Final Answer:

Quick Check: