0
0
Prompt Engineering / GenAIml~5 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is tokenization in natural language processing?
Tokenization is the process of breaking text into smaller pieces called tokens, such as words or subwords, to help machines understand and work with language.
Click to reveal answer
beginner
Why do we need a vocabulary in language models?
A vocabulary is a list of all tokens the model knows. It helps the model convert text into numbers it can process and generate text by choosing tokens from this list.
Click to reveal answer
intermediate
What is the difference between word-level and subword-level tokenization?
Word-level tokenization splits text into whole words, while subword-level breaks words into smaller parts. Subword tokenization helps handle rare or new words better.
Click to reveal answer
intermediate
How does tokenization affect the size of the vocabulary?
Finer tokenization (like subwords) creates a smaller vocabulary because it reuses parts of words, while word-level tokenization needs a bigger vocabulary to cover all words.
Click to reveal answer
beginner
What happens if a token is not in the vocabulary during model use?
If a token is missing, the model usually replaces it with a special unknown token or breaks it into smaller known tokens to still understand it.
Click to reveal answer
What is the main goal of tokenization?
ATo remove punctuation from text
BTo translate text into another language
CTo compress text into fewer characters
DTo split text into smaller pieces called tokens
Which tokenization method helps handle new or rare words better?
ASubword-level tokenization
BWord-level tokenization
CCharacter-level tokenization
DSentence-level tokenization
What does a vocabulary in a language model contain?
AAll possible sentences
BAll tokens the model can recognize
CAll grammar rules
DAll training data
What is a common way models handle tokens not in their vocabulary?
AIgnore the token completely
BAdd the token to the vocabulary instantly
CReplace with an unknown token or split into smaller tokens
DTranslate the token to another language
Which tokenization approach usually results in the largest vocabulary?
AWord-level tokenization
BSubword-level tokenization
CCharacter-level tokenization
DSentence-level tokenization
Explain in your own words what tokenization is and why it is important for language models.
Think about how you might split a sentence into pieces to help a computer read it.
You got /3 concepts.
    Describe the role of vocabulary in a language model and what happens when the model encounters a token not in its vocabulary.
    Consider how a dictionary helps you understand words, and what you do if a word is missing.
    You got /3 concepts.