Challenge - 5 Problems

🎖️

Text Classification Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Model Choice

intermediate

2:00remaining

Choosing the best model for short text classification

You have a dataset of short customer reviews labeled as positive or negative. Which model is most suitable to start with for this text classification task?

AA simple Logistic Regression model with TF-IDF features

BA deep Convolutional Neural Network with raw text input

CA K-Means clustering model to group reviews

DA Principal Component Analysis (PCA) model to reduce text dimensions

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of text preprocessing code

What is the output of the following Python code that preprocesses text for classification?

ML Python

import re
text = "Hello World! This is AI-2024."
processed = re.sub(r'[^a-zA-Z ]', '', text).lower().split()
print(processed)

A['hello', 'world', 'this', 'is', 'ai']

B['Hello', 'World', 'This', 'is', 'AI']

C['hello', 'world', 'this', 'is', 'ai2024']

D['Hello', 'World', 'This', 'is', 'AI2024']

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Selecting the best hyperparameter for TF-IDF vectorizer

You want to improve your text classification model by tuning the TF-IDF vectorizer. Which hyperparameter controls the maximum number of features (words) to keep?

Amin_df

Bmax_features

Cngram_range

Dstop_words

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Interpreting classification report metrics

Your text classification model outputs the following metrics: precision=0.8, recall=0.5, accuracy=0.75. What does the low recall indicate?

AThe model has perfect predictions

BThe model predicts too many false positives

CThe model has balanced errors between classes

DThe model misses many positive examples (false negatives are high)

Attempts:

2 left

🔧 Debug

expert

2:00remaining

Debugging a text classification pipeline error

You run this code snippet for training a text classification model but get a ValueError: Found input variables with inconsistent numbers of samples. What is the cause?

ML Python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
texts = ['good', 'bad', 'average']
labels = [1, 0]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = LogisticRegression()
model.fit(X, labels)

ALogisticRegression requires labels to be strings

BCountVectorizer cannot process single words

CThe labels list length does not match the number of text samples

Dfit_transform returns a dense matrix, but model expects sparse

Attempts:

2 left