0
0
MlopsHow-ToBeginner · 4 min read

Hierarchical Clustering with sklearn in Python: How to Use

Use sklearn.cluster.AgglomerativeClustering to perform hierarchical clustering in Python. Initialize it with parameters like n_clusters and linkage, then call fit on your data to get cluster labels.
📐

Syntax

The main class for hierarchical clustering in sklearn is AgglomerativeClustering. You create an instance with parameters like n_clusters (number of clusters), metric (distance metric), and linkage (how clusters are merged). Then use fit or fit_predict on your data.

  • n_clusters: Number of clusters to find (default 2).
  • metric: Metric to compute linkage ('euclidean' is common).
  • linkage: Method to merge clusters ('ward', 'complete', 'average', 'single').
  • fit(X): Fits the model to data X.
  • fit_predict(X): Fits and returns cluster labels.
python
from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3, metric='euclidean', linkage='ward')
model.fit(X)
labels = model.labels_
💻

Example

This example shows how to cluster simple 2D points into 2 groups using hierarchical clustering with sklearn.

python
from sklearn.cluster import AgglomerativeClustering
import numpy as np

# Sample 2D data points
X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0]])

# Create model with 2 clusters and ward linkage
model = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward')

# Fit model and get cluster labels
labels = model.fit_predict(X)

print('Cluster labels:', labels.tolist())
Output
Cluster labels: [1, 1, 1, 0, 0, 0]
⚠️

Common Pitfalls

  • Using deprecated affinity parameter: In sklearn 1.2+, use metric instead of affinity.
  • Wrong linkage for metric: ward linkage only works with euclidean distance.
  • Not scaling data: Features with different scales can distort clustering results.
  • Forgetting to check labels: Always check model.labels_ after fitting.
python
from sklearn.cluster import AgglomerativeClustering
import numpy as np

X = np.array([[0, 0], [1, 1], [10, 10], [11, 11]])

# Wrong: ward linkage with non-euclidean metric (raises error)
# model = AgglomerativeClustering(n_clusters=2, affinity='manhattan', linkage='ward')

# Correct:
model = AgglomerativeClustering(n_clusters=2, metric='manhattan', linkage='complete')
labels = model.fit_predict(X)
print('Labels:', labels.tolist())
Output
Labels: [1, 1, 0, 0]
📊

Quick Reference

  • n_clusters: Number of clusters to form.
  • metric: Distance metric ('euclidean', 'manhattan', etc.).
  • linkage: How to merge clusters ('ward', 'complete', 'average', 'single').
  • fit(X): Train model on data.
  • labels_: Cluster labels after fitting.

Key Takeaways

Use sklearn.cluster.AgglomerativeClustering with fit or fit_predict to perform hierarchical clustering.
Choose linkage and metric carefully; ward linkage requires euclidean distance.
Always check cluster labels with model.labels_ after fitting.
Scale your data if features have different units or ranges.
In sklearn 1.2+, use metric instead of affinity parameter.