0
0
MlopsHow-ToBeginner · 3 min read

How to Use DBSCAN in sklearn with Python: Syntax and Example

Use DBSCAN from sklearn.cluster by creating an instance with parameters like eps and min_samples, then call fit on your data. The model assigns cluster labels accessible via labels_.
📐

Syntax

The basic syntax to use DBSCAN in sklearn is:

  • DBSCAN(eps=0.5, min_samples=5, metric='euclidean'): creates the clustering model.
  • fit(X): fits the model to data X.
  • labels_: attribute to get cluster labels after fitting.

Parameters explained:

  • eps: maximum distance between two samples to be considered neighbors.
  • min_samples: minimum number of points to form a dense region (cluster).
  • metric: distance metric to use (default is Euclidean).
python
from sklearn.cluster import DBSCAN

# Create DBSCAN model
model = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')

# Fit model to data X
model.fit(X)

# Get cluster labels
labels = model.labels_
💻

Example

This example shows how to cluster simple 2D points using DBSCAN and print the cluster labels.

python
from sklearn.cluster import DBSCAN
import numpy as np

# Sample 2D data points
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])

# Create DBSCAN model with eps=3 and min_samples=2
model = DBSCAN(eps=3, min_samples=2)

# Fit the model
model.fit(X)

# Print cluster labels
print(model.labels_)
Output
[ 0 0 0 1 1 -1]
⚠️

Common Pitfalls

Common mistakes when using DBSCAN include:

  • Setting eps too small, causing many points to be labeled as noise (-1).
  • Setting min_samples too high, resulting in fewer clusters.
  • Not scaling data when features have different units, which affects distance calculations.
  • Using inappropriate distance metrics for your data type.

Always try different eps and min_samples values and consider normalizing your data.

python
from sklearn.cluster import DBSCAN
import numpy as np
from sklearn.preprocessing import StandardScaler

# Data with different scales
X = np.array([[1, 200], [2, 210], [2, 220], [8, 700], [8, 710], [25, 8000]])

# Wrong: no scaling, default eps=0.5
model_wrong = DBSCAN()
model_wrong.fit(X)
print('Labels without scaling:', model_wrong.labels_)

# Right: scale data first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model_right = DBSCAN(eps=0.5, min_samples=2)
model_right.fit(X_scaled)
print('Labels with scaling:', model_right.labels_)
Output
Labels without scaling: [-1 -1 -1 -1 -1 -1] Labels with scaling: [ 0 0 0 1 1 -1]
📊

Quick Reference

Tips for using DBSCAN effectively:

  • Choose eps by plotting k-distance graph or domain knowledge.
  • min_samples is often set to 2 or higher depending on noise tolerance.
  • Scale your data if features have different units.
  • DBSCAN labels noise points as -1.
  • Works well for clusters of similar density.

Key Takeaways

Create a DBSCAN model with eps and min_samples parameters to control clustering sensitivity.
Fit the model to your data using fit() and get cluster labels from labels_.
Scale your data before clustering if features have different units to get meaningful clusters.
Noise points are labeled as -1 and do not belong to any cluster.
Experiment with eps and min_samples to find the best clustering for your data.