How to use DBSCAN sklearn in python

MlopsHow-ToBeginner · 3 min read

How to Use DBSCAN in sklearn with Python: Syntax and Example

Use DBSCAN from sklearn.cluster by creating an instance with parameters like eps and min_samples, then call fit on your data. The model assigns cluster labels accessible via labels_.

📐

Syntax

The basic syntax to use DBSCAN in sklearn is:

DBSCAN(eps=0.5, min_samples=5, metric='euclidean'): creates the clustering model.
fit(X): fits the model to data X.
labels_: attribute to get cluster labels after fitting.

Parameters explained:

eps: maximum distance between two samples to be considered neighbors.
min_samples: minimum number of points to form a dense region (cluster).
metric: distance metric to use (default is Euclidean).

python

from sklearn.cluster import DBSCAN

# Create DBSCAN model
model = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')

# Fit model to data X
model.fit(X)

# Get cluster labels
labels = model.labels_

💻

Example

This example shows how to cluster simple 2D points using DBSCAN and print the cluster labels.

python

from sklearn.cluster import DBSCAN
import numpy as np

# Sample 2D data points
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])

# Create DBSCAN model with eps=3 and min_samples=2
model = DBSCAN(eps=3, min_samples=2)

# Fit the model
model.fit(X)

# Print cluster labels
print(model.labels_)

Output

[ 0 0 0 1 1 -1]

⚠️

Common Pitfalls

Common mistakes when using DBSCAN include:

Setting eps too small, causing many points to be labeled as noise (-1).
Setting min_samples too high, resulting in fewer clusters.
Not scaling data when features have different units, which affects distance calculations.
Using inappropriate distance metrics for your data type.

Always try different eps and min_samples values and consider normalizing your data.

python

from sklearn.cluster import DBSCAN
import numpy as np
from sklearn.preprocessing import StandardScaler

# Data with different scales
X = np.array([[1, 200], [2, 210], [2, 220], [8, 700], [8, 710], [25, 8000]])

# Wrong: no scaling, default eps=0.5
model_wrong = DBSCAN()
model_wrong.fit(X)
print('Labels without scaling:', model_wrong.labels_)

# Right: scale data first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model_right = DBSCAN(eps=0.5, min_samples=2)
model_right.fit(X_scaled)
print('Labels with scaling:', model_right.labels_)

Output

Labels without scaling: [-1 -1 -1 -1 -1 -1] Labels with scaling: [ 0 0 0 1 1 -1]

📊

Quick Reference

Tips for using DBSCAN effectively:

Choose eps by plotting k-distance graph or domain knowledge.
min_samples is often set to 2 or higher depending on noise tolerance.
Scale your data if features have different units.
DBSCAN labels noise points as -1.
Works well for clusters of similar density.

✅

Key Takeaways

Create a DBSCAN model with eps and min_samples parameters to control clustering sensitivity.

Fit the model to your data using fit() and get cluster labels from labels_.

Scale your data before clustering if features have different units to get meaningful clusters.

Noise points are labeled as -1 and do not belong to any cluster.

Experiment with eps and min_samples to find the best clustering for your data.