Challenge - 5 Problems
Tokenization Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
What is the output of this word tokenization code?
Given the following Python code using NLTK for word tokenization, what is the output list?
NLP
from nltk.tokenize import word_tokenize text = "Hello world! Let's test tokenization." tokens = word_tokenize(text) print(tokens)
Attempts:
2 left
💡 Hint
Remember that word_tokenize splits punctuation as separate tokens.
✗ Incorrect
NLTK's word_tokenize splits words and punctuation separately. So "Let's" becomes ['Let', "'s"], and punctuation marks like '!' and '.' are separate tokens.
🧠 Conceptual
intermediate1:30remaining
Which option correctly describes sentence tokenization?
What does sentence tokenization do in Natural Language Processing?
Attempts:
2 left
💡 Hint
Think about how text is divided into meaningful chunks bigger than words.
✗ Incorrect
Sentence tokenization breaks text into sentences, usually by detecting punctuation marks like periods and capitalization patterns.
❓ Metrics
advanced1:30remaining
How many tokens are produced by this sentence tokenizer?
Using NLTK's sent_tokenize on the text below, how many sentences are produced?
"Dr. Smith loves AI. He works at OpenAI! Do you know him?"
NLP
from nltk.tokenize import sent_tokenize text = "Dr. Smith loves AI. He works at OpenAI! Do you know him?" sentences = sent_tokenize(text) print(len(sentences))
Attempts:
2 left
💡 Hint
Consider how abbreviations like 'Dr.' affect sentence splitting.
✗ Incorrect
NLTK's sent_tokenize correctly handles abbreviations and splits into 3 sentences: 'Dr. Smith loves AI.', 'He works at OpenAI!', and 'Do you know him?'.
🔧 Debug
advanced1:30remaining
What error does this tokenization code raise?
Consider this code snippet:
from nltk.tokenize import word_tokenize
text = None
tokens = word_tokenize(text)
print(tokens)
What error will this code raise?
Attempts:
2 left
💡 Hint
Check what type word_tokenize expects as input.
✗ Incorrect
word_tokenize expects a string input. Passing None causes a TypeError because it cannot process None as text.
❓ Model Choice
expert2:00remaining
Which tokenizer is best for splitting text into subword units for transformer models?
You want to prepare text input for a transformer-based language model that uses subword tokenization. Which tokenizer type should you choose?
Attempts:
2 left
💡 Hint
Transformer models often use subword units to handle unknown words efficiently.
✗ Incorrect
Byte-Pair Encoding (BPE) tokenizers split words into subword units, which helps transformer models handle rare or unknown words better than word or character tokenizers.