Challenge - 5 Problems
Tokenization Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediateWhat is the output of this word tokenization code?
Given the following Python code using NLTK for word tokenization, what is the output list?
NLP
from nltk.tokenize import word_tokenize text = "Hello world! Let's test tokenization." tokens = word_tokenize(text) print(tokens)
Attempts:
2 left
💡 Hint
Remember that word_tokenize splits punctuation as separate tokens.
✗ Incorrect
NLTK's word_tokenize splits words and punctuation separately. So "Let's" becomes ['Let', "'s"], and punctuation marks like '!' and '.' are separate tokens.
🧠 Conceptual
intermediateWhich option correctly describes sentence tokenization?
What does sentence tokenization do in Natural Language Processing?
Attempts:
2 left
💡 Hint
Think about how text is divided into meaningful chunks bigger than words.
✗ Incorrect
Sentence tokenization breaks text into sentences, usually by detecting punctuation marks like periods and capitalization patterns.
❓ Metrics
advancedHow many tokens are produced by this sentence tokenizer?
Using NLTK's sent_tokenize on the text below, how many sentences are produced?
"Dr. Smith loves AI. He works at OpenAI! Do you know him?"
NLP
from nltk.tokenize import sent_tokenize text = "Dr. Smith loves AI. He works at OpenAI! Do you know him?" sentences = sent_tokenize(text) print(len(sentences))
Attempts:
2 left
💡 Hint
Consider how abbreviations like 'Dr.' affect sentence splitting.
✗ Incorrect
NLTK's sent_tokenize correctly handles abbreviations and splits into 3 sentences: 'Dr. Smith loves AI.', 'He works at OpenAI!', and 'Do you know him?'.
🔧 Debug
advancedWhat error does this tokenization code raise?
Consider this code snippet:
from nltk.tokenize import word_tokenize
text = None
tokens = word_tokenize(text)
print(tokens)
What error will this code raise?
Attempts:
2 left
💡 Hint
Check what type word_tokenize expects as input.
✗ Incorrect
word_tokenize expects a string input. Passing None causes a TypeError because it cannot process None as text.
❓ Model Choice
expertWhich tokenizer is best for splitting text into subword units for transformer models?
You want to prepare text input for a transformer-based language model that uses subword tokenization. Which tokenizer type should you choose?
Attempts:
2 left
💡 Hint
Transformer models often use subword units to handle unknown words efficiently.
✗ Incorrect
Byte-Pair Encoding (BPE) tokenizers split words into subword units, which helps transformer models handle rare or unknown words better than word or character tokenizers.
