0
0
MlopsHow-ToBeginner · 4 min read

How to Use SMOTE for Imbalanced Data in Python with sklearn

Use SMOTE from imblearn.over_sampling to create synthetic samples for the minority class in imbalanced data. Fit SMOTE on your training features and labels, then transform your data to get a balanced dataset ready for model training.
📐

Syntax

The basic syntax to use SMOTE is:

  • SMOTE(): Creates a SMOTE object with optional parameters like sampling_strategy to control the balance.
  • fit_resample(X, y): Fits SMOTE on features X and labels y, then returns the balanced data.
python
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
💻

Example

This example shows how to use SMOTE to balance an imbalanced dataset, then train a simple logistic regression model and check accuracy.

python
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Create imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.9, 0.1], n_informative=3,
                           n_redundant=1, flip_y=0, n_features=5,
                           n_clusters_per_class=1, n_samples=500, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train_res, y_train_res)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Output
precision recall f1-score support 0 0.98 0.99 0.98 134 1 0.92 0.87 0.89 16 accuracy 0.97 150 macro avg 0.95 0.93 0.94 150 weighted avg 0.97 0.97 0.97 150
⚠️

Common Pitfalls

  • Applying SMOTE before splitting data: This causes data leakage and overly optimistic results. Always split first, then apply SMOTE only on training data.
  • Using SMOTE on test data: Never apply SMOTE on test or validation sets; it should only be used on training data.
  • Ignoring class imbalance in evaluation: Check metrics like recall and F1-score, not just accuracy, because accuracy can be misleading on imbalanced data.
python
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Wrong: Applying SMOTE before split (causes data leakage)
# X_res, y_res = SMOTE().fit_resample(X, y)
# X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3)

# Right: Split first, then apply SMOTE only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_train_res, y_train_res = SMOTE().fit_resample(X_train, y_train)
📊

Quick Reference

SMOTE Quick Tips:

  • Import from imblearn.over_sampling.
  • Use fit_resample(X_train, y_train) after splitting data.
  • Set sampling_strategy to control minority class size.
  • Use random_state for reproducibility.
  • Evaluate with recall, precision, and F1-score.

Key Takeaways

Always apply SMOTE only on training data after splitting to avoid data leakage.
Use SMOTE's fit_resample method to generate synthetic minority class samples.
Check balanced metrics like recall and F1-score, not just accuracy, on imbalanced data.
Set random_state in SMOTE for reproducible results.
SMOTE helps improve model performance by balancing class distribution.