0
0
ML Pythonml~20 mins

Text classification pipeline in ML Python - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Text Classification Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Model Choice
intermediate
2:00remaining
Choosing the best model for short text classification
You have a dataset of short customer reviews labeled as positive or negative. Which model is most suitable to start with for this text classification task?
AA simple Logistic Regression model with TF-IDF features
BA deep Convolutional Neural Network with raw text input
CA K-Means clustering model to group reviews
DA Principal Component Analysis (PCA) model to reduce text dimensions
Attempts:
2 left
💡 Hint
Think about models that work well with labeled data and simple features for text.
Predict Output
intermediate
2:00remaining
Output of text preprocessing code
What is the output of the following Python code that preprocesses text for classification?
ML Python
import re
text = "Hello World! This is AI-2024."
processed = re.sub(r'[^a-zA-Z ]', '', text).lower().split()
print(processed)
A['hello', 'world', 'this', 'is', 'ai']
B['Hello', 'World', 'This', 'is', 'AI']
C['hello', 'world', 'this', 'is', 'ai2024']
D['Hello', 'World', 'This', 'is', 'AI2024']
Attempts:
2 left
💡 Hint
Look at how non-letter characters are removed and text is lowered.
Hyperparameter
advanced
2:00remaining
Selecting the best hyperparameter for TF-IDF vectorizer
You want to improve your text classification model by tuning the TF-IDF vectorizer. Which hyperparameter controls the maximum number of features (words) to keep?
Amin_df
Bmax_features
Cngram_range
Dstop_words
Attempts:
2 left
💡 Hint
This parameter limits the vocabulary size.
Metrics
advanced
2:00remaining
Interpreting classification report metrics
Your text classification model outputs the following metrics: precision=0.8, recall=0.5, accuracy=0.75. What does the low recall indicate?
AThe model has perfect predictions
BThe model predicts too many false positives
CThe model has balanced errors between classes
DThe model misses many positive examples (false negatives are high)
Attempts:
2 left
💡 Hint
Recall measures how many actual positives are found.
🔧 Debug
expert
2:00remaining
Debugging a text classification pipeline error
You run this code snippet for training a text classification model but get a ValueError: Found input variables with inconsistent numbers of samples. What is the cause?
ML Python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
texts = ['good', 'bad', 'average']
labels = [1, 0]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = LogisticRegression()
model.fit(X, labels)
ALogisticRegression requires labels to be strings
BCountVectorizer cannot process single words
CThe labels list length does not match the number of text samples
Dfit_transform returns a dense matrix, but model expects sparse
Attempts:
2 left
💡 Hint
Check if the number of labels matches the number of texts.