How to Use DBSCAN in sklearn with Python: Syntax and Example
Use
DBSCAN from sklearn.cluster by creating an instance with parameters like eps and min_samples, then call fit on your data. The model assigns cluster labels accessible via labels_.Syntax
The basic syntax to use DBSCAN in sklearn is:
DBSCAN(eps=0.5, min_samples=5, metric='euclidean'): creates the clustering model.fit(X): fits the model to dataX.labels_: attribute to get cluster labels after fitting.
Parameters explained:
eps: maximum distance between two samples to be considered neighbors.min_samples: minimum number of points to form a dense region (cluster).metric: distance metric to use (default is Euclidean).
python
from sklearn.cluster import DBSCAN # Create DBSCAN model model = DBSCAN(eps=0.5, min_samples=5, metric='euclidean') # Fit model to data X model.fit(X) # Get cluster labels labels = model.labels_
Example
This example shows how to cluster simple 2D points using DBSCAN and print the cluster labels.
python
from sklearn.cluster import DBSCAN import numpy as np # Sample 2D data points X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]]) # Create DBSCAN model with eps=3 and min_samples=2 model = DBSCAN(eps=3, min_samples=2) # Fit the model model.fit(X) # Print cluster labels print(model.labels_)
Output
[ 0 0 0 1 1 -1]
Common Pitfalls
Common mistakes when using DBSCAN include:
- Setting
epstoo small, causing many points to be labeled as noise (-1). - Setting
min_samplestoo high, resulting in fewer clusters. - Not scaling data when features have different units, which affects distance calculations.
- Using inappropriate distance metrics for your data type.
Always try different eps and min_samples values and consider normalizing your data.
python
from sklearn.cluster import DBSCAN import numpy as np from sklearn.preprocessing import StandardScaler # Data with different scales X = np.array([[1, 200], [2, 210], [2, 220], [8, 700], [8, 710], [25, 8000]]) # Wrong: no scaling, default eps=0.5 model_wrong = DBSCAN() model_wrong.fit(X) print('Labels without scaling:', model_wrong.labels_) # Right: scale data first scaler = StandardScaler() X_scaled = scaler.fit_transform(X) model_right = DBSCAN(eps=0.5, min_samples=2) model_right.fit(X_scaled) print('Labels with scaling:', model_right.labels_)
Output
Labels without scaling: [-1 -1 -1 -1 -1 -1]
Labels with scaling: [ 0 0 0 1 1 -1]
Quick Reference
Tips for using DBSCAN effectively:
- Choose
epsby plotting k-distance graph or domain knowledge. min_samplesis often set to 2 or higher depending on noise tolerance.- Scale your data if features have different units.
- DBSCAN labels noise points as
-1. - Works well for clusters of similar density.
Key Takeaways
Create a DBSCAN model with
eps and min_samples parameters to control clustering sensitivity.Fit the model to your data using
fit() and get cluster labels from labels_.Scale your data before clustering if features have different units to get meaningful clusters.
Noise points are labeled as
-1 and do not belong to any cluster.Experiment with
eps and min_samples to find the best clustering for your data.