0
0
ML Pythonprogramming~5 mins

Choosing K (elbow method, silhouette score) in ML Python

Choose your learning style9 modes available
Introduction

Choosing the right number of clusters (K) helps group data well. It makes sure clusters are clear and useful.

When you want to group customers into meaningful segments.
When organizing photos into similar groups automatically.
When analyzing patterns in sensor data to find distinct states.
When you need to decide how many groups to use in market research.
When exploring data to find natural groupings without labels.
Syntax
ML Python
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Elbow method steps:
# 1. For each K, fit KMeans and get inertia (sum of squared distances).
# 2. Plot K vs inertia to find the 'elbow' point.

# Silhouette score steps:
# 1. For each K, fit KMeans and predict clusters.
# 2. Calculate silhouette_score(X, labels).
# 3. Choose K with highest silhouette score.

The elbow method looks for a point where adding more clusters doesn't improve much.

Silhouette score measures how well each point fits its cluster compared to others.

Examples
Fit KMeans with 3 clusters and get inertia (lower is better).
ML Python
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
print(kmeans.inertia_)
Calculate silhouette score for 4 clusters to check cluster quality.
ML Python
labels = KMeans(n_clusters=4, random_state=42).fit_predict(X)
score = silhouette_score(X, labels)
print(score)
Sample Program

This code creates 3 groups of points, then tries K from 2 to 6 clusters. It prints inertia and silhouette scores for each K. The plots help find the best K by showing the elbow and highest silhouette score.

ML Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Create sample data: 3 groups around centers
np.random.seed(42)
X = np.vstack([
    np.random.normal(loc=0, scale=0.5, size=(50, 2)),
    np.random.normal(loc=5, scale=0.5, size=(50, 2)),
    np.random.normal(loc=10, scale=0.5, size=(50, 2))
])

inertias = []
sil_scores = []
K_range = range(2, 7)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    inertias.append(kmeans.inertia_)
    sil_scores.append(silhouette_score(X, labels))

print("Inertia for K=2 to 6:", inertias)
print("Silhouette scores for K=2 to 6:", sil_scores)

# Plotting results
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters K')
plt.ylabel('Inertia')
plt.title('Elbow Method')

plt.subplot(1,2,2)
plt.plot(K_range, sil_scores, 'ro-')
plt.xlabel('Number of clusters K')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores')

plt.tight_layout()
plt.show()
OutputSuccess
Important Notes

The elbow point is where inertia stops decreasing quickly.

Silhouette score ranges from -1 to 1; closer to 1 means better clusters.

Use both methods together for better decision on K.

Summary

Choosing K helps find the right number of groups in data.

Elbow method looks at inertia to find a point of diminishing returns.

Silhouette score measures how well points fit their clusters.