0
0
NLPml~20 mins

First NLP pipeline - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
NLP Pipeline Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this tokenization code?
Given the following Python code using NLTK, what is the output of the tokens variable?
NLP
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "Hello world! This is a test."
tokens = word_tokenize(text)
print(tokens)
A['Hello world!', 'This is a test.']
B['Hello', 'world', '!', 'This', 'is', 'a', 'test', '.']
C['Hello', 'world!', 'This', 'is', 'a', 'test.']
D['Hello', 'world', 'This', 'is', 'a', 'test']
Attempts:
2 left
💡 Hint
Think about how word_tokenize splits punctuation as separate tokens.
Model Choice
intermediate
1:30remaining
Which model is best for sentiment analysis in an NLP pipeline?
You want to build a simple NLP pipeline to classify movie reviews as positive or negative. Which model is most suitable?
AA convolutional neural network for image classification
BA K-Means clustering model
CA linear regression model
DA pretrained BERT model fine-tuned on sentiment data
Attempts:
2 left
💡 Hint
Sentiment analysis is a text classification task; choose a model designed for text understanding.
Hyperparameter
advanced
1:30remaining
Which hyperparameter affects the number of words considered in a Bag-of-Words model?
In a Bag-of-Words NLP pipeline using CountVectorizer, which hyperparameter controls the maximum number of words (features) to keep?
Astop_words
Bmin_df
Cmax_features
Dngram_range
Attempts:
2 left
💡 Hint
This parameter limits the vocabulary size by frequency.
Metrics
advanced
1:30remaining
Which metric is best to evaluate an imbalanced text classification model?
You trained an NLP model to detect spam emails, but spam emails are only 5% of your data. Which metric is best to evaluate your model?
APrecision and Recall
BAccuracy
CMean Squared Error
DR-squared
Attempts:
2 left
💡 Hint
Accuracy can be misleading when classes are imbalanced.
🔧 Debug
expert
2:30remaining
Why does this NLP pipeline code raise a KeyError?
Consider this code snippet for text preprocessing:
from sklearn.feature_extraction.text import CountVectorizer
texts = ['I love AI', 'AI is fun']
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
print(vectorizer.vocabulary_['AI'])
Why does it raise a KeyError for 'AI'?
NLP
from sklearn.feature_extraction.text import CountVectorizer
texts = ['I love AI', 'AI is fun']
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
print(vectorizer.vocabulary_['AI'])
A'AI' is not in the vocabulary because CountVectorizer lowercases all tokens by default
B'AI' is removed because it is considered a stop word in English
CThe vocabulary_ attribute is not a dictionary, causing the error
DThe fit_transform method was not called before accessing vocabulary_
Attempts:
2 left
💡 Hint
Check how CountVectorizer processes text tokens before building vocabulary.