from imblearn.over_sampling import [1] from sklearn.feature_extraction.text import [2] ros = [1](random_state=0) tfidf = [2](stop_words='english') X = tfidf.fit_transform(texts) X_resampled, y_resampled = ros.fit_resample(X, y)

from imblearn.pipeline import Pipeline from sklearn.linear_model import [1] from imblearn.over_sampling import [2] from sklearn.feature_extraction.text import [3] pipeline = Pipeline([ ('vectorizer', [3](stop_words='english')), ('oversample', [2](random_state=42)), ('classifier', [1]()) ])

Practice

(1/5)

1. What is the main problem caused by imbalanced text data in machine learning models?

easy

A. The model may become biased towards the majority class

B. The model will always have perfect accuracy

C. The model will ignore all classes

D. The model will run faster

Solution

Step 1: Understand class imbalance impact
Imbalanced data means one class has many more examples than others, causing the model to favor that class.
Step 2: Recognize bias effect
This bias leads to poor performance on minority classes, reducing fairness and accuracy for those classes.
Final Answer:
The model may become biased towards the majority class -> Option A
Quick Check:
Imbalanced data causes bias = D [OK]

Hint: Imbalance means bias toward bigger class [OK]

Common Mistakes:

Thinking imbalance improves accuracy
Assuming model ignores all classes
Believing imbalance speeds up training

2. Which Python library function is commonly used to perform upsampling on imbalanced text data?

easy

A. numpy.dot

B. pandas.read_csv

C. sklearn.utils.resample

D. matplotlib.plot

Solution

Step 1: Identify upsampling tool
Upsampling means increasing minority class samples, and sklearn.utils.resample is designed for this.
Step 2: Eliminate unrelated functions
pandas.read_csv loads data, numpy.dot does matrix multiplication, matplotlib.plot draws graphs, so they don't upsample.
Final Answer:
sklearn.utils.resample -> Option C
Quick Check:
Upsampling uses sklearn.utils.resample = A [OK]

Hint: Upsample with sklearn.utils.resample [OK]

Common Mistakes:

Confusing data loading with upsampling
Using plotting or math functions for sampling
Not knowing sklearn utilities

3. Given this Python code snippet for downsampling the majority class in text data, what will be the length of downsampled_majority?

from sklearn.utils import resample
majority = ['a'] * 1000
minority = ['b'] * 100

downsampled_majority = resample(majority, replace=False, n_samples=len(minority), random_state=42)
print(len(downsampled_majority))

medium

A. 1000

B. 42

C. 1100

D. 100

Solution

Step 1: Understand resample parameters
resample is called with n_samples equal to length of minority (100), so it will pick 100 samples from majority.
Step 2: Check replace and output length
replace=False means no duplicates, so output length equals n_samples, which is 100.
Final Answer:
100 -> Option D
Quick Check:
Downsampled length = minority size = 100 [OK]

Hint: Downsample size matches minority length [OK]

Common Mistakes:

Assuming output length equals original majority size
Confusing random_state with sample size
Ignoring n_samples parameter

4. Identify the error in this code snippet that tries to balance imbalanced text data by upsampling minority class:

from sklearn.utils import resample
minority = ['text1', 'text2']
upsampled_minority = resample(minority, replace=True, n_samples=5)
print(len(upsampled_minority))

medium

A. No error; code runs correctly and prints 5

B. Missing random_state parameter causes error

C. replace=True is invalid for resample

D. n_samples must be less than original minority size

Solution

Step 1: Check resample parameters
replace=True allows sampling with replacement, so n_samples can be larger than original minority size.
Step 2: Verify code behavior
random_state is optional; code runs fine and prints length 5 as expected.
Final Answer:
No error; code runs correctly and prints 5 -> Option A
Quick Check:
Upsampling with replacement works = A [OK]

Hint: replace=True allows larger sample size [OK]

Common Mistakes:

Thinking random_state is mandatory
Believing n_samples must be smaller
Confusing replace parameter usage

5. You have a text classification dataset with 90% class A and 10% class B. After upsampling class B to balance the data, which metric should you check to ensure your model performs well on both classes?

hard

A. Accuracy only

B. Precision and recall for each class

C. Training time

D. Number of epochs

Solution

Step 1: Understand metric importance
Accuracy can be misleading with imbalanced data; precision and recall show performance per class.
Step 2: Choose metrics for balanced evaluation
Precision and recall help check if model correctly identifies minority class without many false positives or negatives.
Final Answer:
Precision and recall for each class -> Option B
Quick Check:
Balanced data needs precision & recall check = C [OK]

Hint: Check precision and recall, not just accuracy [OK]

Common Mistakes:

Relying only on accuracy
Ignoring class-wise metrics
Focusing on training time or epochs

Handling imbalanced text data in NLP - Interactive Code Practice

Start learning this pattern below

Practice

Solution

Step 1: Understand class imbalance impact

Step 2: Recognize bias effect

Final Answer:

Quick Check:

Solution

Step 1: Identify upsampling tool

Step 2: Eliminate unrelated functions

Final Answer:

Quick Check:

Solution

Step 1: Understand resample parameters

Step 2: Check replace and output length

Final Answer:

Quick Check:

Solution

Step 1: Check resample parameters

Step 2: Verify code behavior

Final Answer:

Quick Check:

Solution

Step 1: Understand metric importance

Step 2: Choose metrics for balanced evaluation

Final Answer:

Quick Check: