0
0
ML Pythonml~20 mins

Text preprocessing (tokenization, stemming, lemmatization) in ML Python - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Text Preprocessing Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of Tokenization Code
What is the output of the following Python code using NLTK's word_tokenize function?
ML Python
from nltk.tokenize import word_tokenize
text = "Hello there! How's it going?"
tokens = word_tokenize(text)
print(tokens)
A['Hello', 'there', '!', 'How', "'s", 'it', 'going', '?']
B['Hello', 'there', '!', 'How', ''s', 'it', 'going', '?']
C['Hello', 'there', '!', 'How's', 'it', 'going', '?']
D['Hello', 'there', '!', 'How', 's', 'it', 'going', '?']
Attempts:
2 left
💡 Hint
Remember how word_tokenize handles contractions and punctuation.
🧠 Conceptual
intermediate
1:30remaining
Difference Between Stemming and Lemmatization
Which statement correctly describes the difference between stemming and lemmatization?
AStemming removes suffixes to get root forms, while lemmatization returns dictionary base forms considering context.
BLemmatization removes suffixes to get root forms, while stemming returns dictionary base forms considering context.
CBoth stemming and lemmatization always return the same root word without context.
DStemming and lemmatization are identical processes with different names.
Attempts:
2 left
💡 Hint
Think about whether the process uses a dictionary or just cuts word endings.
Metrics
advanced
1:30remaining
Evaluating Tokenization Quality
You have two tokenization methods applied to the sentence: "Cats running faster than dogs." Method A produces 6 tokens, Method B produces 5 tokens. Which metric best helps decide which tokenization is better?
AUse the length of the longest token as the metric.
BUse precision and recall comparing tokens to a gold standard.
CUse model training accuracy directly without token evaluation.
DToken count alone is enough to decide quality.
Attempts:
2 left
💡 Hint
Think about how to compare token lists to a correct reference.
🔧 Debug
advanced
1:30remaining
Error in Lemmatization Code
What error does this code raise? from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() word = 'running' print(lemmatizer.lemmatize(word))
ATypeError because lemmatize requires a list, not a string.
BNameError because WordNetLemmatizer is not imported.
CSyntaxError due to missing parentheses in print statement.
DIt prints 'running' because default pos is 'noun', so no change.
Attempts:
2 left
💡 Hint
Check the default part of speech used by lemmatizer.
Model Choice
expert
2:30remaining
Choosing Preprocessing for Sentiment Analysis
You want to build a sentiment analysis model on social media text with many slang and misspellings. Which preprocessing step combination is best?
AUse only tokenization without stemming or lemmatization to preserve slang.
BUse stemming to reduce words to roots, ignoring slang and misspellings.
CUse lemmatization with POS tagging to normalize words, plus custom slang dictionary.
DRemove all punctuation and lowercase text only, no tokenization.
Attempts:
2 left
💡 Hint
Consider how to handle slang and misspellings while normalizing words.