Agglomerative Clustering in Python with sklearn: What It Is and How to Use
sklearn.cluster.AgglomerativeClustering. It builds clusters from the bottom up, joining small groups into bigger ones until the desired number of clusters is reached.How It Works
Agglomerative clustering is like making a family tree but for data points. Imagine you have many dots on a paper, and you want to group them by closeness. First, each dot is its own group. Then, you find the two closest groups and join them together. You keep doing this step-by-step, joining the nearest groups, until you have just a few big groups left.
This process is called "bottom-up" because you start with many small groups and build up to bigger ones. The closeness between groups can be measured in different ways, like the shortest distance between any two points in the groups or the average distance. This method helps find natural clusters in data without needing to guess their shape.
Example
This example shows how to use AgglomerativeClustering from sklearn to group simple 2D points into clusters.
from sklearn.cluster import AgglomerativeClustering import numpy as np # Sample data: 6 points in 2D space X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) # Create the clustering model to find 2 clusters model = AgglomerativeClustering(n_clusters=2) # Fit the model and get cluster labels labels = model.fit_predict(X) print(labels)
When to Use
Agglomerative clustering is useful when you want to find groups in data without knowing their exact shape or size. It works well for small to medium datasets where you want a clear hierarchy of clusters.
Real-world uses include grouping similar documents, customer segmentation in marketing, or organizing images by similarity. It is especially helpful when you want to understand how clusters form step-by-step, as it creates a tree-like structure called a dendrogram (though sklearn's basic class does not plot it directly).
Key Points
- Agglomerative clustering merges closest groups step-by-step from many small clusters to fewer big ones.
- It uses distance measures to decide which clusters to join.
- Implemented in Python with
sklearn.cluster.AgglomerativeClustering. - Good for hierarchical grouping and small to medium datasets.
- Produces cluster labels that assign each data point to a cluster.