MlopsDebug / FixBeginner · 4 min read

How to Fix Class Imbalance Problem in Python with sklearn

To fix class imbalance in Python, use sklearn techniques like SMOTE for oversampling the minority class or set class_weight='balanced' in classifiers to give more importance to minority classes. These methods help models learn better from imbalanced data and improve prediction accuracy.

🔍

Why This Happens

Class imbalance happens when one class has many more examples than another in your dataset. This causes models to ignore the smaller class and predict mostly the majority class, leading to poor results on the minority class.

Here is an example where a model is trained on imbalanced data without any fix:

python

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Create imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.9, 0.1], n_informative=3,
                           n_redundant=1, flip_y=0, n_features=5,
                           n_clusters_per_class=1, n_samples=1000, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(classification_report(y_test, preds))

Output

precision recall f1-score support 0 0.95 0.99 0.97 269 1 0.70 0.34 0.45 31 accuracy 0.94 300 macro avg 0.82 0.66 0.71 300 weighted avg 0.93 0.94 0.93 300

🔧

The Fix

To fix class imbalance, you can use oversampling like SMOTE to create synthetic minority samples or set class_weight='balanced' in your model to give more importance to the minority class. This helps the model learn better and improve minority class predictions.

python

from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification

# Create imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.9, 0.1], n_informative=3,
                           n_redundant=1, flip_y=0, n_features=5,
                           n_clusters_per_class=1, n_samples=1000, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply SMOTE oversampling
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

model = LogisticRegression(class_weight='balanced')
model.fit(X_train_res, y_train_res)
preds = model.predict(X_test)
print(classification_report(y_test, preds))

Output

precision recall f1-score support 0 0.95 0.97 0.96 269 1 0.62 0.52 0.57 31 accuracy 0.92 300 macro avg 0.79 0.74 0.76 300 weighted avg 0.92 0.92 0.92 300

🛡️

Prevention

To avoid class imbalance problems, always check your dataset's class distribution before training. Use techniques like oversampling, undersampling, or class weights early in your workflow. Also, consider collecting more data for minority classes if possible.

Best practices include:

Use collections.Counter or np.bincount to check class counts.
Apply resampling methods like SMOTE or RandomUnderSampler.
Set class_weight='balanced' in sklearn classifiers.
Evaluate models with metrics sensitive to imbalance like recall, precision, and F1-score.

⚠️

Related Errors

Common related issues include:

Overfitting to majority class: Model predicts only the majority class, ignoring minority.
Misleading accuracy: High accuracy but poor minority class detection.
Data leakage during resampling: Oversampling before splitting data causes over-optimistic results.

Fixes involve using balanced metrics, proper train-test splitting before resampling, and applying class weights or resampling carefully.

✅

Key Takeaways

Check class distribution early to detect imbalance.

Use SMOTE or class_weight='balanced' to fix imbalance.

Evaluate models with recall, precision, and F1-score, not just accuracy.

Always split data before applying resampling to avoid data leakage.

Collect more minority class data if possible for best results.