Hierarchical Clustering with sklearn in Python: How to Use
Use
sklearn.cluster.AgglomerativeClustering to perform hierarchical clustering in Python. Initialize it with parameters like n_clusters and linkage, then call fit on your data to get cluster labels.Syntax
The main class for hierarchical clustering in sklearn is AgglomerativeClustering. You create an instance with parameters like n_clusters (number of clusters), metric (distance metric), and linkage (how clusters are merged). Then use fit or fit_predict on your data.
- n_clusters: Number of clusters to find (default 2).
- metric: Metric to compute linkage ('euclidean' is common).
- linkage: Method to merge clusters ('ward', 'complete', 'average', 'single').
- fit(X): Fits the model to data
X. - fit_predict(X): Fits and returns cluster labels.
python
from sklearn.cluster import AgglomerativeClustering model = AgglomerativeClustering(n_clusters=3, metric='euclidean', linkage='ward') model.fit(X) labels = model.labels_
Example
This example shows how to cluster simple 2D points into 2 groups using hierarchical clustering with sklearn.
python
from sklearn.cluster import AgglomerativeClustering import numpy as np # Sample 2D data points X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) # Create model with 2 clusters and ward linkage model = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward') # Fit model and get cluster labels labels = model.fit_predict(X) print('Cluster labels:', labels.tolist())
Output
Cluster labels: [1, 1, 1, 0, 0, 0]
Common Pitfalls
- Using deprecated
affinityparameter: In sklearn 1.2+, usemetricinstead ofaffinity. - Wrong linkage for metric:
wardlinkage only works witheuclideandistance. - Not scaling data: Features with different scales can distort clustering results.
- Forgetting to check labels: Always check
model.labels_after fitting.
python
from sklearn.cluster import AgglomerativeClustering import numpy as np X = np.array([[0, 0], [1, 1], [10, 10], [11, 11]]) # Wrong: ward linkage with non-euclidean metric (raises error) # model = AgglomerativeClustering(n_clusters=2, affinity='manhattan', linkage='ward') # Correct: model = AgglomerativeClustering(n_clusters=2, metric='manhattan', linkage='complete') labels = model.fit_predict(X) print('Labels:', labels.tolist())
Output
Labels: [1, 1, 0, 0]
Quick Reference
- n_clusters: Number of clusters to form.
- metric: Distance metric ('euclidean', 'manhattan', etc.).
- linkage: How to merge clusters ('ward', 'complete', 'average', 'single').
- fit(X): Train model on data.
- labels_: Cluster labels after fitting.
Key Takeaways
Use sklearn.cluster.AgglomerativeClustering with fit or fit_predict to perform hierarchical clustering.
Choose linkage and metric carefully; ward linkage requires euclidean distance.
Always check cluster labels with model.labels_ after fitting.
Scale your data if features have different units or ranges.
In sklearn 1.2+, use metric instead of affinity parameter.