How to Handle Imbalanced Dataset in Python with sklearn
imbalanced dataset in Python, use techniques like SMOTE for oversampling the minority class or class_weight='balanced' in sklearn models to adjust for class imbalance. These methods help your model learn equally from all classes and improve prediction accuracy.Why This Happens
Imbalanced datasets occur when one class has many more samples than another. This causes models to favor the majority class, leading to poor predictions on the minority class.
For example, if you train a model on data where 95% are class A and 5% are class B, the model might just predict class A all the time to get high accuracy, ignoring class B.
from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Create imbalanced dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.95, 0.05], n_informative=3, n_redundant=1, flip_y=0, n_features=5, n_clusters_per_class=1, n_samples=1000, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = LogisticRegression() model.fit(X_train, y_train) preds = model.predict(X_test) print(classification_report(y_test, preds))
The Fix
To fix imbalance, use SMOTE to create synthetic samples for the minority class or set class_weight='balanced' in your model to give more importance to the minority class.
This helps the model learn better from both classes and improves recall and f1-score for the minority class.
from imblearn.over_sampling import SMOTE from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Create imbalanced dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.95, 0.05], n_informative=3, n_redundant=1, flip_y=0, n_features=5, n_clusters_per_class=1, n_samples=1000, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Apply SMOTE to balance training data smote = SMOTE(random_state=42) X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train) model = LogisticRegression() model.fit(X_train_bal, y_train_bal) preds = model.predict(X_test) print(classification_report(y_test, preds))
Prevention
To avoid imbalance issues, always check your dataset's class distribution before training. Use Counter from collections to count classes.
Apply resampling techniques like SMOTE, RandomOverSampler, or RandomUnderSampler as needed. Also, consider using class_weight='balanced' in sklearn models to automatically adjust weights.
Regularly evaluate models with metrics like recall, precision, and f1-score, not just accuracy, to catch imbalance problems early.
from collections import Counter # Check class distribution print(Counter(y))
Related Errors
Common related issues include:
- Overfitting minority class: Oversampling without care can cause the model to memorize minority samples.
- Ignoring minority class: Using accuracy alone hides poor minority class performance.
- Data leakage: Applying resampling before splitting data can leak test info into training.
Fix these by using proper train-test splits before resampling, monitoring multiple metrics, and tuning model complexity.