0
0
MlopsDebug / FixBeginner · 4 min read

How to Handle Imbalanced Dataset in Python with sklearn

To handle an imbalanced dataset in Python, use techniques like SMOTE for oversampling the minority class or class_weight='balanced' in sklearn models to adjust for class imbalance. These methods help your model learn equally from all classes and improve prediction accuracy.
🔍

Why This Happens

Imbalanced datasets occur when one class has many more samples than another. This causes models to favor the majority class, leading to poor predictions on the minority class.

For example, if you train a model on data where 95% are class A and 5% are class B, the model might just predict class A all the time to get high accuracy, ignoring class B.

python
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Create imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.95, 0.05], n_informative=3,
                           n_redundant=1, flip_y=0, n_features=5,
                           n_clusters_per_class=1, n_samples=1000, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(classification_report(y_test, preds))
Output
precision recall f1-score support 0 0.98 1.00 0.99 288 1 1.00 0.20 0.33 12 accuracy 0.98 300 macro avg 0.99 0.60 0.66 300 weighted avg 0.98 0.98 0.97 300
🔧

The Fix

To fix imbalance, use SMOTE to create synthetic samples for the minority class or set class_weight='balanced' in your model to give more importance to the minority class.

This helps the model learn better from both classes and improves recall and f1-score for the minority class.

python
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Create imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.95, 0.05], n_informative=3,
                           n_redundant=1, flip_y=0, n_features=5,
                           n_clusters_per_class=1, n_samples=1000, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply SMOTE to balance training data
smote = SMOTE(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train)

model = LogisticRegression()
model.fit(X_train_bal, y_train_bal)
preds = model.predict(X_test)
print(classification_report(y_test, preds))
Output
precision recall f1-score support 0 0.98 0.97 0.98 288 1 0.43 0.58 0.49 12 accuracy 0.95 300 macro avg 0.70 0.78 0.73 300 weighted avg 0.96 0.95 0.95 300
🛡️

Prevention

To avoid imbalance issues, always check your dataset's class distribution before training. Use Counter from collections to count classes.

Apply resampling techniques like SMOTE, RandomOverSampler, or RandomUnderSampler as needed. Also, consider using class_weight='balanced' in sklearn models to automatically adjust weights.

Regularly evaluate models with metrics like recall, precision, and f1-score, not just accuracy, to catch imbalance problems early.

python
from collections import Counter

# Check class distribution
print(Counter(y))
Output
Counter({0: 950, 1: 50})
⚠️

Related Errors

Common related issues include:

  • Overfitting minority class: Oversampling without care can cause the model to memorize minority samples.
  • Ignoring minority class: Using accuracy alone hides poor minority class performance.
  • Data leakage: Applying resampling before splitting data can leak test info into training.

Fix these by using proper train-test splits before resampling, monitoring multiple metrics, and tuning model complexity.

Key Takeaways

Always check class balance before training your model.
Use SMOTE or class_weight='balanced' to handle imbalanced datasets effectively.
Evaluate models with recall and f1-score, not just accuracy.
Apply resampling only on training data to avoid data leakage.
Monitor for overfitting when using oversampling techniques.