Recall & Review
beginner
What is tokenization in natural language processing?
Tokenization is the process of breaking text into smaller pieces called tokens, such as words or subwords, to help machines understand and work with language.
Click to reveal answer
beginner
Why do we need a vocabulary in language models?
A vocabulary is a list of all tokens the model knows. It helps the model convert text into numbers it can process and generate text by choosing tokens from this list.
Click to reveal answer
intermediate
What is the difference between word-level and subword-level tokenization?
Word-level tokenization splits text into whole words, while subword-level breaks words into smaller parts. Subword tokenization helps handle rare or new words better.
Click to reveal answer
intermediate
How does tokenization affect the size of the vocabulary?
Finer tokenization (like subwords) creates a smaller vocabulary because it reuses parts of words, while word-level tokenization needs a bigger vocabulary to cover all words.
Click to reveal answer
beginner
What happens if a token is not in the vocabulary during model use?
If a token is missing, the model usually replaces it with a special unknown token or breaks it into smaller known tokens to still understand it.
Click to reveal answer
What is the main goal of tokenization?
✗ Incorrect
Tokenization breaks text into tokens so machines can process language.
Which tokenization method helps handle new or rare words better?
✗ Incorrect
Subword tokenization breaks words into smaller parts, helping with rare or new words.
What does a vocabulary in a language model contain?
✗ Incorrect
Vocabulary lists all tokens the model knows to convert text into numbers.
What is a common way models handle tokens not in their vocabulary?
✗ Incorrect
Models replace unknown tokens or break them into smaller known tokens to understand them.
Which tokenization approach usually results in the largest vocabulary?
✗ Incorrect
Word-level tokenization needs a large vocabulary to cover all words.
Explain in your own words what tokenization is and why it is important for language models.
Think about how you might split a sentence into pieces to help a computer read it.
You got /3 concepts.
Describe the role of vocabulary in a language model and what happens when the model encounters a token not in its vocabulary.
Consider how a dictionary helps you understand words, and what you do if a word is missing.
You got /3 concepts.