Challenge - 5 Problems

🎖️

Text Preprocessing Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of Tokenization Code

What is the output of the following Python code using NLTK's word_tokenize function?

ML Python

from nltk.tokenize import word_tokenize
text = "Hello there! How's it going?"
tokens = word_tokenize(text)
print(tokens)

A['Hello', 'there', '!', 'How', "'s", 'it', 'going', '?']

B['Hello', 'there', '!', 'How', ''s', 'it', 'going', '?']

C['Hello', 'there', '!', 'How's', 'it', 'going', '?']

D['Hello', 'there', '!', 'How', 's', 'it', 'going', '?']

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Difference Between Stemming and Lemmatization

Which statement correctly describes the difference between stemming and lemmatization?

AStemming removes suffixes to get root forms, while lemmatization returns dictionary base forms considering context.

BLemmatization removes suffixes to get root forms, while stemming returns dictionary base forms considering context.

CBoth stemming and lemmatization always return the same root word without context.

DStemming and lemmatization are identical processes with different names.

Attempts:

2 left

❓ Metrics

advanced

1:30remaining

Evaluating Tokenization Quality

You have two tokenization methods applied to the sentence: "Cats running faster than dogs." Method A produces 6 tokens, Method B produces 5 tokens. Which metric best helps decide which tokenization is better?

AUse the length of the longest token as the metric.

BUse precision and recall comparing tokens to a gold standard.

CUse model training accuracy directly without token evaluation.

DToken count alone is enough to decide quality.

Attempts:

2 left

🔧 Debug

advanced

1:30remaining

Error in Lemmatization Code

What error does this code raise? from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() word = 'running' print(lemmatizer.lemmatize(word))

ATypeError because lemmatize requires a list, not a string.

BNameError because WordNetLemmatizer is not imported.

CSyntaxError due to missing parentheses in print statement.

DIt prints 'running' because default pos is 'noun', so no change.

Attempts:

2 left

❓ Model Choice

expert

2:30remaining

Choosing Preprocessing for Sentiment Analysis

You want to build a sentiment analysis model on social media text with many slang and misspellings. Which preprocessing step combination is best?

AUse only tokenization without stemming or lemmatization to preserve slang.

BUse stemming to reduce words to roots, ignoring slang and misspellings.

CUse lemmatization with POS tagging to normalize words, plus custom slang dictionary.

DRemove all punctuation and lowercase text only, no tokenization.

Attempts:

2 left