Choosing the right number of clusters (K) helps group data well. It makes sure clusters are clear and useful.
0
0
Choosing K (elbow method, silhouette score) in ML Python
Introduction
When you want to group customers into meaningful segments.
When organizing photos into similar groups automatically.
When analyzing patterns in sensor data to find distinct states.
When you need to decide how many groups to use in market research.
When exploring data to find natural groupings without labels.
Syntax
ML Python
from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score # Elbow method steps: # 1. For each K, fit KMeans and get inertia (sum of squared distances). # 2. Plot K vs inertia to find the 'elbow' point. # Silhouette score steps: # 1. For each K, fit KMeans and predict clusters. # 2. Calculate silhouette_score(X, labels). # 3. Choose K with highest silhouette score.
The elbow method looks for a point where adding more clusters doesn't improve much.
Silhouette score measures how well each point fits its cluster compared to others.
Examples
Fit KMeans with 3 clusters and get inertia (lower is better).
ML Python
kmeans = KMeans(n_clusters=3, random_state=42) kmeans.fit(X) print(kmeans.inertia_)
Calculate silhouette score for 4 clusters to check cluster quality.
ML Python
labels = KMeans(n_clusters=4, random_state=42).fit_predict(X) score = silhouette_score(X, labels) print(score)
Sample Program
This code creates 3 groups of points, then tries K from 2 to 6 clusters. It prints inertia and silhouette scores for each K. The plots help find the best K by showing the elbow and highest silhouette score.
ML Python
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score # Create sample data: 3 groups around centers np.random.seed(42) X = np.vstack([ np.random.normal(loc=0, scale=0.5, size=(50, 2)), np.random.normal(loc=5, scale=0.5, size=(50, 2)), np.random.normal(loc=10, scale=0.5, size=(50, 2)) ]) inertias = [] sil_scores = [] K_range = range(2, 7) for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42) labels = kmeans.fit_predict(X) inertias.append(kmeans.inertia_) sil_scores.append(silhouette_score(X, labels)) print("Inertia for K=2 to 6:", inertias) print("Silhouette scores for K=2 to 6:", sil_scores) # Plotting results plt.figure(figsize=(12,5)) plt.subplot(1,2,1) plt.plot(K_range, inertias, 'bo-') plt.xlabel('Number of clusters K') plt.ylabel('Inertia') plt.title('Elbow Method') plt.subplot(1,2,2) plt.plot(K_range, sil_scores, 'ro-') plt.xlabel('Number of clusters K') plt.ylabel('Silhouette Score') plt.title('Silhouette Scores') plt.tight_layout() plt.show()
OutputSuccess
Important Notes
The elbow point is where inertia stops decreasing quickly.
Silhouette score ranges from -1 to 1; closer to 1 means better clusters.
Use both methods together for better decision on K.
Summary
Choosing K helps find the right number of groups in data.
Elbow method looks at inertia to find a point of diminishing returns.
Silhouette score measures how well points fit their clusters.