How to Fix Class Imbalance Problem in Python with sklearn
sklearn techniques like SMOTE for oversampling the minority class or set class_weight='balanced' in classifiers to give more importance to minority classes. These methods help models learn better from imbalanced data and improve prediction accuracy.Why This Happens
Class imbalance happens when one class has many more examples than another in your dataset. This causes models to ignore the smaller class and predict mostly the majority class, leading to poor results on the minority class.
Here is an example where a model is trained on imbalanced data without any fix:
from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Create imbalanced dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9, 0.1], n_informative=3, n_redundant=1, flip_y=0, n_features=5, n_clusters_per_class=1, n_samples=1000, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = LogisticRegression() model.fit(X_train, y_train) preds = model.predict(X_test) print(classification_report(y_test, preds))
The Fix
To fix class imbalance, you can use oversampling like SMOTE to create synthetic minority samples or set class_weight='balanced' in your model to give more importance to the minority class. This helps the model learn better and improve minority class predictions.
from imblearn.over_sampling import SMOTE from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.datasets import make_classification # Create imbalanced dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9, 0.1], n_informative=3, n_redundant=1, flip_y=0, n_features=5, n_clusters_per_class=1, n_samples=1000, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Apply SMOTE oversampling smote = SMOTE(random_state=42) X_train_res, y_train_res = smote.fit_resample(X_train, y_train) model = LogisticRegression(class_weight='balanced') model.fit(X_train_res, y_train_res) preds = model.predict(X_test) print(classification_report(y_test, preds))
Prevention
To avoid class imbalance problems, always check your dataset's class distribution before training. Use techniques like oversampling, undersampling, or class weights early in your workflow. Also, consider collecting more data for minority classes if possible.
Best practices include:
- Use
collections.Counterornp.bincountto check class counts. - Apply resampling methods like
SMOTEorRandomUnderSampler. - Set
class_weight='balanced'in sklearn classifiers. - Evaluate models with metrics sensitive to imbalance like recall, precision, and F1-score.
Related Errors
Common related issues include:
- Overfitting to majority class: Model predicts only the majority class, ignoring minority.
- Misleading accuracy: High accuracy but poor minority class detection.
- Data leakage during resampling: Oversampling before splitting data causes over-optimistic results.
Fixes involve using balanced metrics, proper train-test splitting before resampling, and applying class weights or resampling carefully.