0
0
NLPml~20 mins

Handling imbalanced text data in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Imbalanced Text Data Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Why use SMOTE for imbalanced text data?
You have a text classification task with very few examples of one class. Why might you use SMOTE (Synthetic Minority Over-sampling Technique) on the text features?
ATo create synthetic examples of the minority class by interpolating feature vectors, helping balance the dataset.
BTo remove noisy examples from the majority class to reduce imbalance.
CTo convert text data into numerical vectors using TF-IDF.
DTo randomly duplicate minority class examples without changing their features.
Attempts:
2 left
💡 Hint
Think about how SMOTE creates new data points rather than just copying existing ones.
Predict Output
intermediate
2:00remaining
Output of class distribution after random oversampling
Given the following code that uses RandomOverSampler on text data features, what will be the printed class distribution?
NLP
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from imblearn.over_sampling import RandomOverSampler

texts = ['good', 'bad', 'good', 'bad', 'bad', 'good', 'good', 'bad', 'bad', 'bad']
labels = [1, 0, 1, 0, 0, 1, 1, 0, 0, 0]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, labels)

print(Counter(y_res))
ACounter({0: 6, 1: 6})
BCounter({0: 5, 1: 5})
CCounter({0: 7, 1: 3})
DCounter({0: 6, 1: 4})
Attempts:
2 left
💡 Hint
RandomOverSampler balances classes by duplicating minority class samples until counts match.
Model Choice
advanced
2:00remaining
Best model choice for imbalanced text classification
You have a highly imbalanced text dataset with 95% negative and 5% positive labels. Which model choice is best to handle this imbalance?
AA K-Nearest Neighbors model with k=3 and no class weighting.
BA simple neural network without any class weighting or sampling.
CA decision tree with default parameters and no imbalance handling.
DA logistic regression model with class_weight='balanced' parameter.
Attempts:
2 left
💡 Hint
Consider models that can adjust learning to pay more attention to minority class.
Hyperparameter
advanced
2:00remaining
Choosing the right threshold for imbalanced text classification
After training a binary text classifier on imbalanced data, you notice low recall for the minority class. Which hyperparameter adjustment can help improve recall?
AIncrease the learning rate to speed up training.
BLower the classification threshold below 0.5 to predict more positives.
CIncrease the batch size to stabilize gradients.
DUse early stopping to prevent overfitting.
Attempts:
2 left
💡 Hint
Recall improves when the model predicts more positive cases, even if some are false positives.
Metrics
expert
2:00remaining
Choosing the best metric for imbalanced text data evaluation
You trained a text classifier on imbalanced data. Which metric is best to evaluate model performance focusing on minority class detection?
AAccuracy, because it shows overall correct predictions.
BLog Loss, because it measures probability calibration.
CF1-score, because it balances precision and recall for the minority class.
DMean Squared Error, because it measures prediction error.
Attempts:
2 left
💡 Hint
Accuracy can be misleading when classes are imbalanced.