How to Use SMOTE for Imbalanced Data in Python with sklearn
Use
SMOTE from imblearn.over_sampling to create synthetic samples for the minority class in imbalanced data. Fit SMOTE on your training features and labels, then transform your data to get a balanced dataset ready for model training.Syntax
The basic syntax to use SMOTE is:
SMOTE(): Creates a SMOTE object with optional parameters likesampling_strategyto control the balance.fit_resample(X, y): Fits SMOTE on featuresXand labelsy, then returns the balanced data.
python
from imblearn.over_sampling import SMOTE smote = SMOTE(sampling_strategy='auto', random_state=42) X_resampled, y_resampled = smote.fit_resample(X, y)
Example
This example shows how to use SMOTE to balance an imbalanced dataset, then train a simple logistic regression model and check accuracy.
python
from sklearn.datasets import make_classification from imblearn.over_sampling import SMOTE from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Create imbalanced dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9, 0.1], n_informative=3, n_redundant=1, flip_y=0, n_features=5, n_clusters_per_class=1, n_samples=500, random_state=42) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Apply SMOTE to training data smote = SMOTE(random_state=42) X_train_res, y_train_res = smote.fit_resample(X_train, y_train) # Train model model = LogisticRegression(random_state=42) model.fit(X_train_res, y_train_res) # Predict and evaluate y_pred = model.predict(X_test) print(classification_report(y_test, y_pred))
Output
precision recall f1-score support
0 0.98 0.99 0.98 134
1 0.92 0.87 0.89 16
accuracy 0.97 150
macro avg 0.95 0.93 0.94 150
weighted avg 0.97 0.97 0.97 150
Common Pitfalls
- Applying SMOTE before splitting data: This causes data leakage and overly optimistic results. Always split first, then apply SMOTE only on training data.
- Using SMOTE on test data: Never apply SMOTE on test or validation sets; it should only be used on training data.
- Ignoring class imbalance in evaluation: Check metrics like recall and F1-score, not just accuracy, because accuracy can be misleading on imbalanced data.
python
from sklearn.model_selection import train_test_split from imblearn.over_sampling import SMOTE # Wrong: Applying SMOTE before split (causes data leakage) # X_res, y_res = SMOTE().fit_resample(X, y) # X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3) # Right: Split first, then apply SMOTE only on training data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) X_train_res, y_train_res = SMOTE().fit_resample(X_train, y_train)
Quick Reference
SMOTE Quick Tips:
- Import from
imblearn.over_sampling. - Use
fit_resample(X_train, y_train)after splitting data. - Set
sampling_strategyto control minority class size. - Use random_state for reproducibility.
- Evaluate with recall, precision, and F1-score.
Key Takeaways
Always apply SMOTE only on training data after splitting to avoid data leakage.
Use SMOTE's fit_resample method to generate synthetic minority class samples.
Check balanced metrics like recall and F1-score, not just accuracy, on imbalanced data.
Set random_state in SMOTE for reproducible results.
SMOTE helps improve model performance by balancing class distribution.