Challenge - 5 Problems
Text Preprocessing Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of Tokenization Code
What is the output of the following Python code using NLTK's word_tokenize function?
ML Python
from nltk.tokenize import word_tokenize text = "Hello there! How's it going?" tokens = word_tokenize(text) print(tokens)
Attempts:
2 left
💡 Hint
Remember how word_tokenize handles contractions and punctuation.
✗ Incorrect
NLTK's word_tokenize splits contractions like "How's" into ['How', "'s"], keeping the apostrophe as part of the token.
🧠 Conceptual
intermediate1:30remaining
Difference Between Stemming and Lemmatization
Which statement correctly describes the difference between stemming and lemmatization?
Attempts:
2 left
💡 Hint
Think about whether the process uses a dictionary or just cuts word endings.
✗ Incorrect
Stemming is a crude process that chops off word endings, often producing non-words. Lemmatization uses vocabulary and morphological analysis to return the base or dictionary form of a word.
❓ Metrics
advanced1:30remaining
Evaluating Tokenization Quality
You have two tokenization methods applied to the sentence: "Cats running faster than dogs." Method A produces 6 tokens, Method B produces 5 tokens. Which metric best helps decide which tokenization is better?
Attempts:
2 left
💡 Hint
Think about how to compare token lists to a correct reference.
✗ Incorrect
Precision and recall compare the tokens produced by a method to a gold standard set of tokens, measuring how many tokens are correctly identified and how many are missed or extra.
🔧 Debug
advanced1:30remaining
Error in Lemmatization Code
What error does this code raise?
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word = 'running'
print(lemmatizer.lemmatize(word))
Attempts:
2 left
💡 Hint
Check the default part of speech used by lemmatizer.
✗ Incorrect
The lemmatizer defaults to noun pos, so 'running' as a noun stays 'running'. To get 'run', pos='v' (verb) must be specified.
❓ Model Choice
expert2:30remaining
Choosing Preprocessing for Sentiment Analysis
You want to build a sentiment analysis model on social media text with many slang and misspellings. Which preprocessing step combination is best?
Attempts:
2 left
💡 Hint
Consider how to handle slang and misspellings while normalizing words.
✗ Incorrect
Lemmatization with POS tagging normalizes words properly, and a custom slang dictionary helps handle informal language common in social media, improving model input quality.