Challenge - 5 Problems
Text Classification Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Model Choice
intermediate2:00remaining
Choosing the best model for short text classification
You have a dataset of short customer reviews labeled as positive or negative. Which model is most suitable to start with for this text classification task?
Attempts:
2 left
💡 Hint
Think about models that work well with labeled data and simple features for text.
✗ Incorrect
Logistic Regression with TF-IDF is a strong baseline for text classification, especially for short texts. CNNs require more data and preprocessing. K-Means is unsupervised, and PCA is for dimensionality reduction, not classification.
❓ Predict Output
intermediate2:00remaining
Output of text preprocessing code
What is the output of the following Python code that preprocesses text for classification?
ML Python
import re text = "Hello World! This is AI-2024." processed = re.sub(r'[^a-zA-Z ]', '', text).lower().split() print(processed)
Attempts:
2 left
💡 Hint
Look at how non-letter characters are removed and text is lowered.
✗ Incorrect
The regex removes all characters except letters and spaces, so 'AI-2024' becomes 'AI'. Then the text is converted to lowercase and split into words.
❓ Hyperparameter
advanced2:00remaining
Selecting the best hyperparameter for TF-IDF vectorizer
You want to improve your text classification model by tuning the TF-IDF vectorizer. Which hyperparameter controls the maximum number of features (words) to keep?
Attempts:
2 left
💡 Hint
This parameter limits the vocabulary size.
✗ Incorrect
max_features sets the maximum number of words to keep based on term frequency. min_df controls minimum document frequency, ngram_range controls word group sizes, and stop_words removes common words.
❓ Metrics
advanced2:00remaining
Interpreting classification report metrics
Your text classification model outputs the following metrics: precision=0.8, recall=0.5, accuracy=0.75. What does the low recall indicate?
Attempts:
2 left
💡 Hint
Recall measures how many actual positives are found.
✗ Incorrect
Low recall means the model fails to identify many true positive cases, resulting in many false negatives. Precision relates to false positives, accuracy is overall correctness.
🔧 Debug
expert2:00remaining
Debugging a text classification pipeline error
You run this code snippet for training a text classification model but get a ValueError: Found input variables with inconsistent numbers of samples. What is the cause?
ML Python
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression texts = ['good', 'bad', 'average'] labels = [1, 0] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model = LogisticRegression() model.fit(X, labels)
Attempts:
2 left
💡 Hint
Check if the number of labels matches the number of texts.
✗ Incorrect
The error occurs because texts has 3 samples but labels has only 2, causing mismatch in training data and labels.