How to Use Silhouette Score in Python with sklearn
Use
silhouette_score from sklearn.metrics to measure how well clusters are separated in your data. Pass your data and cluster labels to silhouette_score(X, labels) to get a score between -1 and 1, where higher values mean better clustering.Syntax
The silhouette_score function has this syntax:
silhouette_score(X, labels, metric='euclidean')
Where:
Xis your data array or matrix.labelsare the cluster labels for each data point.metricis the distance metric to use (default is 'euclidean').
python
from sklearn.metrics import silhouette_score score = silhouette_score(X, labels, metric='euclidean')
Example
This example shows how to cluster data with KMeans and then calculate the silhouette score to check clustering quality.
python
from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score # Create sample data with 3 clusters X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0) # Fit KMeans clustering kmeans = KMeans(n_clusters=3, random_state=0) labels = kmeans.fit_predict(X) # Calculate silhouette score score = silhouette_score(X, labels) print(f'Silhouette Score: {score:.3f}')
Output
Silhouette Score: 0.59
Common Pitfalls
Common mistakes when using silhouette score include:
- Passing cluster labels that do not match the data size.
- Using silhouette score on data that is not clustered (labels all the same).
- Ignoring the fact that silhouette score works best with 2 or more clusters.
- Using inappropriate distance metrics for your data type.
Always ensure your labels come from a clustering algorithm and match your data points.
python
from sklearn.metrics import silhouette_score # Wrong: labels length does not match data X = [[1, 2], [3, 4], [5, 6]] labels_wrong = [0, 1] # Only 2 labels for 3 points # This will raise an error # silhouette_score(X, labels_wrong) # Right: labels length matches data labels_right = [0, 1, 0] score = silhouette_score(X, labels_right) print(f'Correct Silhouette Score: {score:.3f}')
Output
Correct Silhouette Score: 0.707
Quick Reference
Tips for using silhouette score effectively:
- Score ranges from -1 (bad) to +1 (good).
- Higher score means clusters are well separated.
- Use to compare different cluster counts.
- Works best with numeric data and Euclidean distance.
Key Takeaways
Use silhouette_score(X, labels) from sklearn.metrics to evaluate clustering quality.
Silhouette score values near 1 mean good separation; near -1 mean poor clustering.
Ensure labels array length matches the number of data points in X.
Silhouette score helps choose the best number of clusters by comparing scores.
Use appropriate distance metrics matching your data type for accurate scores.