Challenge - 5 Problems
NLP Pipeline Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
What is the output of this tokenization code?
Given the following Python code using NLTK, what is the output of the
tokens variable?NLP
import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize text = "Hello world! This is a test." tokens = word_tokenize(text) print(tokens)
Attempts:
2 left
💡 Hint
Think about how word_tokenize splits punctuation as separate tokens.
✗ Incorrect
The word_tokenize function splits the text into words and punctuation marks separately, so punctuation like '!' and '.' become their own tokens.
❓ Model Choice
intermediate1:30remaining
Which model is best for sentiment analysis in an NLP pipeline?
You want to build a simple NLP pipeline to classify movie reviews as positive or negative. Which model is most suitable?
Attempts:
2 left
💡 Hint
Sentiment analysis is a text classification task; choose a model designed for text understanding.
✗ Incorrect
BERT is a powerful language model that can be fine-tuned for sentiment classification tasks, unlike clustering or regression models or image CNNs.
❓ Hyperparameter
advanced1:30remaining
Which hyperparameter affects the number of words considered in a Bag-of-Words model?
In a Bag-of-Words NLP pipeline using CountVectorizer, which hyperparameter controls the maximum number of words (features) to keep?
Attempts:
2 left
💡 Hint
This parameter limits the vocabulary size by frequency.
✗ Incorrect
max_features sets the maximum number of features (words) to keep based on frequency, controlling vocabulary size.
❓ Metrics
advanced1:30remaining
Which metric is best to evaluate an imbalanced text classification model?
You trained an NLP model to detect spam emails, but spam emails are only 5% of your data. Which metric is best to evaluate your model?
Attempts:
2 left
💡 Hint
Accuracy can be misleading when classes are imbalanced.
✗ Incorrect
Precision and recall give better insight on performance for imbalanced classes by measuring false positives and false negatives.
🔧 Debug
expert2:30remaining
Why does this NLP pipeline code raise a KeyError?
Consider this code snippet for text preprocessing:
from sklearn.feature_extraction.text import CountVectorizer texts = ['I love AI', 'AI is fun'] vectorizer = CountVectorizer(stop_words='english') X = vectorizer.fit_transform(texts) print(vectorizer.vocabulary_['AI'])Why does it raise a KeyError for 'AI'?
NLP
from sklearn.feature_extraction.text import CountVectorizer texts = ['I love AI', 'AI is fun'] vectorizer = CountVectorizer(stop_words='english') X = vectorizer.fit_transform(texts) print(vectorizer.vocabulary_['AI'])
Attempts:
2 left
💡 Hint
Check how CountVectorizer processes text tokens before building vocabulary.
✗ Incorrect
CountVectorizer lowercases all tokens by default, so 'AI' becomes 'ai'. Accessing 'AI' (uppercase) causes KeyError.