0
0
NLPml~20 mins

Tokenization (word and sentence) in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Tokenization Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this word tokenization code?
Given the following Python code using NLTK for word tokenization, what is the output list?
NLP
from nltk.tokenize import word_tokenize
text = "Hello world! Let's test tokenization."
tokens = word_tokenize(text)
print(tokens)
A['Hello', 'world', '!', 'Let's', 'test', 'tokenization', '.']
B['Hello', 'world!', "Let's", 'test', 'tokenization.']
C['Hello', 'world', '!', 'Let', "'s", 'test', 'tokenization', '.']
D['Hello world!', "Let's test tokenization."]
Attempts:
2 left
💡 Hint
Remember that word_tokenize splits punctuation as separate tokens.
🧠 Conceptual
intermediate
1:30remaining
Which option correctly describes sentence tokenization?
What does sentence tokenization do in Natural Language Processing?
AConverts words into numerical vectors for machine learning.
BSplits text into individual words, separating punctuation.
CRemoves stopwords from the text.
DSplits text into sentences based on punctuation and capitalization.
Attempts:
2 left
💡 Hint
Think about how text is divided into meaningful chunks bigger than words.
Metrics
advanced
1:30remaining
How many tokens are produced by this sentence tokenizer?
Using NLTK's sent_tokenize on the text below, how many sentences are produced? "Dr. Smith loves AI. He works at OpenAI! Do you know him?"
NLP
from nltk.tokenize import sent_tokenize
text = "Dr. Smith loves AI. He works at OpenAI! Do you know him?"
sentences = sent_tokenize(text)
print(len(sentences))
A3
B1
C4
D2
Attempts:
2 left
💡 Hint
Consider how abbreviations like 'Dr.' affect sentence splitting.
🔧 Debug
advanced
1:30remaining
What error does this tokenization code raise?
Consider this code snippet: from nltk.tokenize import word_tokenize text = None tokens = word_tokenize(text) print(tokens) What error will this code raise?
ATypeError: expected string or bytes-like object
BNameError: name 'word_tokenize' is not defined
CAttributeError: 'NoneType' object has no attribute 'split'
DValueError: empty string passed to tokenizer
Attempts:
2 left
💡 Hint
Check what type word_tokenize expects as input.
Model Choice
expert
2:00remaining
Which tokenizer is best for splitting text into subword units for transformer models?
You want to prepare text input for a transformer-based language model that uses subword tokenization. Which tokenizer type should you choose?
ACharacter tokenizer that splits text into individual characters
BByte-Pair Encoding (BPE) tokenizer that splits words into subword units
CWhitespace tokenizer that splits text only on spaces
DSentence tokenizer that splits text into sentences
Attempts:
2 left
💡 Hint
Transformer models often use subword units to handle unknown words efficiently.