Bird
Raised Fist0
NLPml~20 mins

Handling imbalanced text data in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Handling imbalanced text data
Problem:We want to classify text messages into two categories: spam and not spam. The dataset has 90% not spam and 10% spam messages. The current model learns well on training data but performs poorly on spam detection in validation.
Current Metrics:Training accuracy: 95%, Validation accuracy: 88%, Validation recall for spam: 50%
Issue:The model is biased towards the majority class (not spam). It misses many spam messages, showing poor recall on the minority class.
Your Task
Improve the model to detect spam better by increasing validation recall for spam to at least 75%, while keeping overall validation accuracy above 85%.
You can only modify data preprocessing and model training steps.
Do not change the model architecture drastically.
Keep training time reasonable (under 5 minutes).
Hint 1
Hint 2
Hint 3
Solution
NLP
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

# Sample data (replace with real dataset)
texts = ["Free money now", "Hi, how are you?", "Win a prize", "Let's meet tomorrow", "Cheap meds available", "Are you coming?", "Congratulations, you won!", "Call me later"] * 100
labels = [1 if i % 10 == 0 else 0 for i in range(len(texts))]  # 1=spam, 0=not spam (~10% spam)

# Split data
X_train, X_val, y_train, y_val = train_test_split(texts, labels, test_size=0.2, stratify=labels, random_state=42)

# Vectorize text
vectorizer = TfidfVectorizer(max_features=1000)
X_train_vec = vectorizer.fit_transform(X_train).toarray()
X_val_vec = vectorizer.transform(X_val).toarray()

# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = {i: w for i, w in enumerate(class_weights)}

# Build model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train_vec.shape[1],)),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Early stopping
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train with class weights
history = model.fit(X_train_vec, np.array(y_train), epochs=20, batch_size=32, validation_data=(X_val_vec, np.array(y_val)), class_weight=class_weight_dict, callbacks=[early_stop], verbose=0)

# Evaluate
val_preds = (model.predict(X_val_vec) > 0.5).astype(int).flatten()
from sklearn.metrics import accuracy_score, recall_score
val_accuracy = accuracy_score(y_val, val_preds) * 100
val_recall_spam = recall_score(y_val, val_preds, pos_label=1) * 100

print(f"Validation accuracy: {val_accuracy:.2f}%")
print(f"Validation recall for spam: {val_recall_spam:.2f}%")
Added class weights to the model training to give more importance to the minority spam class.
Included dropout layer to reduce overfitting.
Used stratified split to keep class distribution consistent.
Kept the model architecture simple but added dropout for regularization.
Results Interpretation

Before: Training accuracy 95%, Validation accuracy 88%, Spam recall 50%
After: Training accuracy 92%, Validation accuracy 89%, Spam recall 78%

Using class weights helps the model pay more attention to the minority class, improving recall for spam messages and reducing bias towards the majority class.
Bonus Experiment
Try oversampling the minority class using techniques like SMOTE or simple duplication and compare results with class weighting.
💡 Hint
Use imblearn's SMOTE to create synthetic spam samples before training and observe if recall improves further.

Practice

(1/5)
1. What is the main problem caused by imbalanced text data in machine learning models?
easy
A. The model may become biased towards the majority class
B. The model will always have perfect accuracy
C. The model will ignore all classes
D. The model will run faster

Solution

  1. Step 1: Understand class imbalance impact

    Imbalanced data means one class has many more examples than others, causing the model to favor that class.
  2. Step 2: Recognize bias effect

    This bias leads to poor performance on minority classes, reducing fairness and accuracy for those classes.
  3. Final Answer:

    The model may become biased towards the majority class -> Option A
  4. Quick Check:

    Imbalanced data causes bias = D [OK]
Hint: Imbalance means bias toward bigger class [OK]
Common Mistakes:
  • Thinking imbalance improves accuracy
  • Assuming model ignores all classes
  • Believing imbalance speeds up training
2. Which Python library function is commonly used to perform upsampling on imbalanced text data?
easy
A. numpy.dot
B. pandas.read_csv
C. sklearn.utils.resample
D. matplotlib.plot

Solution

  1. Step 1: Identify upsampling tool

    Upsampling means increasing minority class samples, and sklearn.utils.resample is designed for this.
  2. Step 2: Eliminate unrelated functions

    pandas.read_csv loads data, numpy.dot does matrix multiplication, matplotlib.plot draws graphs, so they don't upsample.
  3. Final Answer:

    sklearn.utils.resample -> Option C
  4. Quick Check:

    Upsampling uses sklearn.utils.resample = A [OK]
Hint: Upsample with sklearn.utils.resample [OK]
Common Mistakes:
  • Confusing data loading with upsampling
  • Using plotting or math functions for sampling
  • Not knowing sklearn utilities
3. Given this Python code snippet for downsampling the majority class in text data, what will be the length of downsampled_majority?
from sklearn.utils import resample
majority = ['a'] * 1000
minority = ['b'] * 100

downsampled_majority = resample(majority, replace=False, n_samples=len(minority), random_state=42)
print(len(downsampled_majority))
medium
A. 1000
B. 42
C. 1100
D. 100

Solution

  1. Step 1: Understand resample parameters

    resample is called with n_samples equal to length of minority (100), so it will pick 100 samples from majority.
  2. Step 2: Check replace and output length

    replace=False means no duplicates, so output length equals n_samples, which is 100.
  3. Final Answer:

    100 -> Option D
  4. Quick Check:

    Downsampled length = minority size = 100 [OK]
Hint: Downsample size matches minority length [OK]
Common Mistakes:
  • Assuming output length equals original majority size
  • Confusing random_state with sample size
  • Ignoring n_samples parameter
4. Identify the error in this code snippet that tries to balance imbalanced text data by upsampling minority class:
from sklearn.utils import resample
minority = ['text1', 'text2']
upsampled_minority = resample(minority, replace=True, n_samples=5)
print(len(upsampled_minority))
medium
A. No error; code runs correctly and prints 5
B. Missing random_state parameter causes error
C. replace=True is invalid for resample
D. n_samples must be less than original minority size

Solution

  1. Step 1: Check resample parameters

    replace=True allows sampling with replacement, so n_samples can be larger than original minority size.
  2. Step 2: Verify code behavior

    random_state is optional; code runs fine and prints length 5 as expected.
  3. Final Answer:

    No error; code runs correctly and prints 5 -> Option A
  4. Quick Check:

    Upsampling with replacement works = A [OK]
Hint: replace=True allows larger sample size [OK]
Common Mistakes:
  • Thinking random_state is mandatory
  • Believing n_samples must be smaller
  • Confusing replace parameter usage
5. You have a text classification dataset with 90% class A and 10% class B. After upsampling class B to balance the data, which metric should you check to ensure your model performs well on both classes?
hard
A. Accuracy only
B. Precision and recall for each class
C. Training time
D. Number of epochs

Solution

  1. Step 1: Understand metric importance

    Accuracy can be misleading with imbalanced data; precision and recall show performance per class.
  2. Step 2: Choose metrics for balanced evaluation

    Precision and recall help check if model correctly identifies minority class without many false positives or negatives.
  3. Final Answer:

    Precision and recall for each class -> Option B
  4. Quick Check:

    Balanced data needs precision & recall check = C [OK]
Hint: Check precision and recall, not just accuracy [OK]
Common Mistakes:
  • Relying only on accuracy
  • Ignoring class-wise metrics
  • Focusing on training time or epochs