Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is tokenization in natural language processing?
Tokenization is the process of breaking text into smaller pieces called tokens, such as words or subwords, to help machines understand and work with language.
Click to reveal answer
beginner
Why do we need a vocabulary in language models?
A vocabulary is a list of all tokens the model knows. It helps the model convert text into numbers it can process and generate text by choosing tokens from this list.
Click to reveal answer
intermediate
What is the difference between word-level and subword-level tokenization?
Word-level tokenization splits text into whole words, while subword-level breaks words into smaller parts. Subword tokenization helps handle rare or new words better.
Click to reveal answer
intermediate
How does tokenization affect the size of the vocabulary?
Finer tokenization (like subwords) creates a smaller vocabulary because it reuses parts of words, while word-level tokenization needs a bigger vocabulary to cover all words.
Click to reveal answer
beginner
What happens if a token is not in the vocabulary during model use?
If a token is missing, the model usually replaces it with a special unknown token or breaks it into smaller known tokens to still understand it.
Click to reveal answer
What is the main goal of tokenization?
ATo remove punctuation from text
BTo translate text into another language
CTo compress text into fewer characters
DTo split text into smaller pieces called tokens
✗ Incorrect
Tokenization breaks text into tokens so machines can process language.
Which tokenization method helps handle new or rare words better?
ASubword-level tokenization
BWord-level tokenization
CCharacter-level tokenization
DSentence-level tokenization
✗ Incorrect
Subword tokenization breaks words into smaller parts, helping with rare or new words.
What does a vocabulary in a language model contain?
AAll possible sentences
BAll tokens the model can recognize
CAll grammar rules
DAll training data
✗ Incorrect
Vocabulary lists all tokens the model knows to convert text into numbers.
What is a common way models handle tokens not in their vocabulary?
AIgnore the token completely
BAdd the token to the vocabulary instantly
CReplace with an unknown token or split into smaller tokens
DTranslate the token to another language
✗ Incorrect
Models replace unknown tokens or break them into smaller known tokens to understand them.
Which tokenization approach usually results in the largest vocabulary?
AWord-level tokenization
BSubword-level tokenization
CCharacter-level tokenization
DSentence-level tokenization
✗ Incorrect
Word-level tokenization needs a large vocabulary to cover all words.
Explain in your own words what tokenization is and why it is important for language models.
Think about how you might split a sentence into pieces to help a computer read it.
You got /3 concepts.
Describe the role of vocabulary in a language model and what happens when the model encounters a token not in its vocabulary.
Consider how a dictionary helps you understand words, and what you do if a word is missing.
You got /3 concepts.
Practice
(1/5)
1. What does tokenization do in natural language processing?
easy
A. Converts tokens into images
B. Breaks text into smaller pieces called tokens
C. Removes all punctuation from text
D. Combines multiple texts into one
Solution
Step 1: Understand the role of tokenization
Tokenization splits text into smaller parts called tokens, like words or subwords.
Step 2: Compare options with tokenization definition
Only Breaks text into smaller pieces called tokens correctly describes breaking text into tokens.
Final Answer:
Breaks text into smaller pieces called tokens -> Option B
Quick Check:
Tokenization = splitting text [OK]
Hint: Tokenization means splitting text into pieces [OK]
Common Mistakes:
Thinking tokenization changes text to images
Confusing tokenization with removing punctuation
Believing tokenization merges texts
2. Which of the following is the correct way to represent a token ID in Python?
easy
A. token_id = 'word'
B. token_id = {word: 1}
C. token_id = [word]
D. token_id = 123
Solution
Step 1: Understand token ID representation
Token IDs are numbers representing tokens, so they should be integers.
Step 2: Check each option's type
token_id = 123 assigns an integer 123, which is correct. Others use strings, lists, or dictionaries incorrectly.
Final Answer:
token_id = 123 -> Option D
Quick Check:
Token ID = number [OK]
Hint: Token IDs are numbers, not words or lists [OK]
Common Mistakes:
Using strings instead of numbers for token IDs
Confusing token IDs with token text
Using lists or dictionaries wrongly
3. Given the vocabulary {'hello': 1, 'world': 2, '!': 3}, what is the token ID list for the text 'hello world!'?
medium
A. [1, 2, 3]
B. [0, 1, 2]
C. ['hello', 'world', '!']
D. [3, 2, 1]
Solution
Step 1: Map each word to its token ID
'hello' maps to 1, 'world' maps to 2, and '!' maps to 3 according to the vocabulary.
Step 2: Create the token ID list in order
The text 'hello world!' becomes [1, 2, 3].
Final Answer:
[1, 2, 3] -> Option A
Quick Check:
Text tokens = [1, 2, 3] [OK]
Hint: Match words to IDs in order [OK]
Common Mistakes:
Mixing up token order
Using token text instead of IDs
Assigning wrong IDs from vocabulary
4. What is wrong with this tokenization code snippet?
vocab = {'hi': 1, 'there': 2}
text = 'hi there'
tokens = [vocab[word] for word in text.split() if word in vocab]
medium
A. It will raise a KeyError if a word is missing
B. It correctly tokenizes the text
C. It ignores words not in vocabulary
D. It uses split() incorrectly on the text
Solution
Step 1: Analyze the list comprehension
The code splits text and includes only words found in vocab, skipping others.
Step 2: Identify behavior on unknown words
Words not in vocab are ignored, which may lose information.
Final Answer:
It ignores words not in vocabulary -> Option C
Quick Check:
Unknown words skipped = ignoring [OK]
Hint: Check if unknown words are skipped or cause errors [OK]
Common Mistakes:
Assuming KeyError will happen due to 'if' check
Thinking split() is wrong here
Missing that unknown words are ignored silently
5. You have a vocabulary with tokens: {'I':1, 'love':2, 'AI':3, '.':4}. How would you tokenize the sentence 'I love AI!' considering the exclamation mark is not in the vocabulary?
hard
A. Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5]
B. Replace '!' with '.' and tokenize as [1, 2, 3, 4]
C. Ignore '!' and tokenize as [1, 2, 3]
D. Raise an error because '!' is unknown
Solution
Step 1: Understand vocabulary coverage
The vocabulary lacks '!', so it must be added to handle the sentence fully.
Step 2: Add '!' with a new token ID
Assign '!' a new ID (e.g., 5) and tokenize the sentence as [1, 2, 3, 5].
Final Answer:
Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5] -> Option A
Quick Check:
Unknown token added = new ID [OK]
Hint: Add unknown tokens to vocabulary before tokenizing [OK]