Bird
Raised Fist0
NLPml~20 mins

Tokenization (word and sentence) in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Tokenization Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this word tokenization code?
Given the following Python code using NLTK for word tokenization, what is the output list?
NLP
from nltk.tokenize import word_tokenize
text = "Hello world! Let's test tokenization."
tokens = word_tokenize(text)
print(tokens)
A['Hello', 'world', '!', 'Let's', 'test', 'tokenization', '.']
B['Hello', 'world!', "Let's", 'test', 'tokenization.']
C['Hello', 'world', '!', 'Let', "'s", 'test', 'tokenization', '.']
D['Hello world!', "Let's test tokenization."]
Attempts:
2 left
💡 Hint
Remember that word_tokenize splits punctuation as separate tokens.
🧠 Conceptual
intermediate
1:30remaining
Which option correctly describes sentence tokenization?
What does sentence tokenization do in Natural Language Processing?
AConverts words into numerical vectors for machine learning.
BSplits text into individual words, separating punctuation.
CRemoves stopwords from the text.
DSplits text into sentences based on punctuation and capitalization.
Attempts:
2 left
💡 Hint
Think about how text is divided into meaningful chunks bigger than words.
Metrics
advanced
1:30remaining
How many tokens are produced by this sentence tokenizer?
Using NLTK's sent_tokenize on the text below, how many sentences are produced? "Dr. Smith loves AI. He works at OpenAI! Do you know him?"
NLP
from nltk.tokenize import sent_tokenize
text = "Dr. Smith loves AI. He works at OpenAI! Do you know him?"
sentences = sent_tokenize(text)
print(len(sentences))
A3
B1
C4
D2
Attempts:
2 left
💡 Hint
Consider how abbreviations like 'Dr.' affect sentence splitting.
🔧 Debug
advanced
1:30remaining
What error does this tokenization code raise?
Consider this code snippet: from nltk.tokenize import word_tokenize text = None tokens = word_tokenize(text) print(tokens) What error will this code raise?
ATypeError: expected string or bytes-like object
BNameError: name 'word_tokenize' is not defined
CAttributeError: 'NoneType' object has no attribute 'split'
DValueError: empty string passed to tokenizer
Attempts:
2 left
💡 Hint
Check what type word_tokenize expects as input.
Model Choice
expert
2:00remaining
Which tokenizer is best for splitting text into subword units for transformer models?
You want to prepare text input for a transformer-based language model that uses subword tokenization. Which tokenizer type should you choose?
ACharacter tokenizer that splits text into individual characters
BByte-Pair Encoding (BPE) tokenizer that splits words into subword units
CWhitespace tokenizer that splits text only on spaces
DSentence tokenizer that splits text into sentences
Attempts:
2 left
💡 Hint
Transformer models often use subword units to handle unknown words efficiently.

Practice

(1/5)
1. What is the main purpose of tokenization in natural language processing?
easy
A. To remove stop words from text
B. To translate text into another language
C. To split text into smaller units like words or sentences
D. To generate new sentences from text

Solution

  1. Step 1: Understand tokenization

    Tokenization means breaking text into smaller pieces such as words or sentences.
  2. Step 2: Identify the main goal

    The main goal is to prepare text for further processing by splitting it into tokens.
  3. Final Answer:

    To split text into smaller units like words or sentences -> Option C
  4. Quick Check:

    Tokenization = splitting text [OK]
Hint: Tokenization means cutting text into pieces [OK]
Common Mistakes:
  • Confusing tokenization with translation
  • Thinking tokenization removes words
  • Believing tokenization generates new text
2. Which of the following Python code snippets correctly tokenizes a sentence into words using NLTK?
easy
A. from nltk.tokenize import word_tokenize sentence = 'Hello world!' tokens = word_tokenize(sentence)
B. import nltk sentence = 'Hello world!' tokens = nltk.split(sentence)
C. from nltk.tokenize import sent_tokenize sentence = 'Hello world!' tokens = sent_tokenize(sentence)
D. sentence = 'Hello world!' tokens = sentence.split_words()

Solution

  1. Step 1: Check correct import and function

    The correct function to tokenize words in NLTK is word_tokenize from nltk.tokenize.
  2. Step 2: Verify code correctness

    from nltk.tokenize import word_tokenize sentence = 'Hello world!' tokens = word_tokenize(sentence) imports word_tokenize and applies it correctly to the sentence.
  3. Final Answer:

    from nltk.tokenize import word_tokenize\nsentence = 'Hello world!'\ntokens = word_tokenize(sentence) -> Option A
  4. Quick Check:

    Use word_tokenize for word splitting [OK]
Hint: Use word_tokenize from nltk.tokenize for words [OK]
Common Mistakes:
  • Using sent_tokenize for word tokenization
  • Calling non-existent split_words() method
  • Using nltk.split which does not exist
3. What will be the output of this Python code using NLTK?
from nltk.tokenize import sent_tokenize
text = 'Hello world! How are you?'
sentences = sent_tokenize(text)
print(sentences)
medium
A. ['Hello world!', 'How are you?']
B. ['Hello world! How are you?']
C. ['Hello', 'world!', 'How', 'are', 'you?']
D. ['Hello world', 'How are you']

Solution

  1. Step 1: Understand sent_tokenize function

    sent_tokenize splits text into sentences based on punctuation.
  2. Step 2: Apply sent_tokenize to the text

    The text has two sentences: 'Hello world!' and 'How are you?'.
  3. Final Answer:

    ['Hello world!', 'How are you?'] -> Option A
  4. Quick Check:

    sent_tokenize splits sentences correctly [OK]
Hint: sent_tokenize splits text at sentence ends [OK]
Common Mistakes:
  • Confusing sent_tokenize with word_tokenize output
  • Expecting no split for multiple sentences
  • Ignoring punctuation as sentence boundary
4. Identify the error in this code snippet for word tokenization using NLTK:
import nltk
tokens = nltk.word_tokenize('Hello world!')
medium
A. The string should be a list, not a plain string
B. word_tokenize should be called as nltk.tokenize.word_tokenize
C. word_tokenize does not exist in NLTK
D. Missing import of word_tokenize from nltk.tokenize

Solution

  1. Step 1: Check how word_tokenize is imported

    word_tokenize is in nltk.tokenize, not directly in nltk module.
  2. Step 2: Identify correct import

    Must import word_tokenize specifically: from nltk.tokenize import word_tokenize.
  3. Final Answer:

    Missing import of word_tokenize from nltk.tokenize -> Option D
  4. Quick Check:

    Import word_tokenize correctly [OK]
Hint: Import word_tokenize from nltk.tokenize, not nltk [OK]
Common Mistakes:
  • Assuming nltk.word_tokenize exists
  • Trying to call word_tokenize without import
  • Passing list instead of string to tokenizer
5. Given a paragraph with multiple sentences, how can you tokenize it into words while preserving sentence boundaries using NLTK?
hard
A. Use word_tokenize directly on the whole paragraph
B. Use sent_tokenize to split sentences, then word_tokenize each sentence separately
C. Use split() method on the paragraph string
D. Use sent_tokenize only, it also splits words

Solution

  1. Step 1: Understand the need to preserve sentence boundaries

    Preserving sentence boundaries means keeping words grouped by sentences.
  2. Step 2: Apply sent_tokenize then word_tokenize

    First split paragraph into sentences, then tokenize words in each sentence separately.
  3. Final Answer:

    Use sent_tokenize to split sentences, then word_tokenize each sentence separately -> Option B
  4. Quick Check:

    Split sentences first, then words [OK]
Hint: Split sentences first, then tokenize words inside each [OK]
Common Mistakes:
  • Tokenizing words directly loses sentence grouping
  • Using split() which is too simple
  • Assuming sent_tokenize splits words