Imbalanced text data means some categories have many examples, and others have very few. Handling this helps models learn better and make fair predictions.
0
0
Handling imbalanced text data in NLP
Introduction
When classifying emails as spam or not spam, but spam emails are much fewer.
When detecting rare events in customer reviews, like complaints versus normal feedback.
When sorting news articles into topics where some topics appear less often.
When building sentiment analysis models with many neutral texts but few positive or negative ones.
Syntax
NLP
from sklearn.utils import resample # Upsample minority class minority_upsampled = resample(minority_class_data, replace=True, n_samples=desired_count, random_state=42) # Downsample majority class majority_downsampled = resample(majority_class_data, replace=False, n_samples=desired_count, random_state=42)
Use resample to increase or decrease samples in classes.
Set random_state for reproducible results.
Examples
This makes the smaller class as big as the larger one by repeating samples.
NLP
from sklearn.utils import resample # Upsample minority class to match majority minority_upsampled = resample(minority_data, replace=True, n_samples=len(majority_data), random_state=1)
This reduces the larger class size by randomly picking fewer samples.
NLP
from sklearn.utils import resample # Downsample majority class to match minority majority_downsampled = resample(majority_data, replace=False, n_samples=len(minority_data), random_state=1)
SMOTE creates new synthetic samples for the minority class to balance data.
NLP
from imblearn.over_sampling import SMOTE smote = SMOTE(random_state=42) X_resampled, y_resampled = smote.fit_resample(X, y)
Sample Model
This example shows how to balance a small text dataset by repeating the minority class samples, then train a simple model and check its performance.
NLP
from sklearn.feature_extraction.text import CountVectorizer from sklearn.utils import resample from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report # Sample imbalanced text data texts = ["good product", "bad product", "excellent", "poor quality", "nice", "terrible", "awesome", "bad", "great", "awful"] labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] # 1=positive, 0=negative # Separate majority and minority classes texts_pos = [t for t, l in zip(texts, labels) if l == 1] texts_neg = [t for t, l in zip(texts, labels) if l == 0] # Upsample minority class (positive) to match majority (negative) texts_pos_upsampled = resample(texts_pos, replace=True, n_samples=len(texts_neg), random_state=42) labels_pos_upsampled = [1] * len(texts_pos_upsampled) # Combine balanced data texts_balanced = texts_neg + texts_pos_upsampled labels_balanced = [0] * len(texts_neg) + labels_pos_upsampled # Vectorize text vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts_balanced) y = labels_balanced # Train logistic regression model = LogisticRegression(random_state=42) model.fit(X, y) # Predict on training data predictions = model.predict(X) # Print classification report report = classification_report(y, predictions, zero_division=0) print(report)
OutputSuccess
Important Notes
Upsampling can cause overfitting because it repeats data.
Downsampling may lose useful information by removing data.
Try synthetic methods like SMOTE for better balance without duplicates.
Summary
Imbalanced text data can hurt model learning and fairness.
Use upsampling, downsampling, or synthetic sampling to balance classes.
Check model performance carefully after balancing data.