Bird
Raised Fist0
Prompt Engineering / GenAIml~5 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is tokenization in natural language processing?
Tokenization is the process of breaking text into smaller pieces called tokens, such as words or subwords, to help machines understand and work with language.
Click to reveal answer
beginner
Why do we need a vocabulary in language models?
A vocabulary is a list of all tokens the model knows. It helps the model convert text into numbers it can process and generate text by choosing tokens from this list.
Click to reveal answer
intermediate
What is the difference between word-level and subword-level tokenization?
Word-level tokenization splits text into whole words, while subword-level breaks words into smaller parts. Subword tokenization helps handle rare or new words better.
Click to reveal answer
intermediate
How does tokenization affect the size of the vocabulary?
Finer tokenization (like subwords) creates a smaller vocabulary because it reuses parts of words, while word-level tokenization needs a bigger vocabulary to cover all words.
Click to reveal answer
beginner
What happens if a token is not in the vocabulary during model use?
If a token is missing, the model usually replaces it with a special unknown token or breaks it into smaller known tokens to still understand it.
Click to reveal answer
What is the main goal of tokenization?
ATo remove punctuation from text
BTo translate text into another language
CTo compress text into fewer characters
DTo split text into smaller pieces called tokens
Which tokenization method helps handle new or rare words better?
ASubword-level tokenization
BWord-level tokenization
CCharacter-level tokenization
DSentence-level tokenization
What does a vocabulary in a language model contain?
AAll possible sentences
BAll tokens the model can recognize
CAll grammar rules
DAll training data
What is a common way models handle tokens not in their vocabulary?
AIgnore the token completely
BAdd the token to the vocabulary instantly
CReplace with an unknown token or split into smaller tokens
DTranslate the token to another language
Which tokenization approach usually results in the largest vocabulary?
AWord-level tokenization
BSubword-level tokenization
CCharacter-level tokenization
DSentence-level tokenization
Explain in your own words what tokenization is and why it is important for language models.
Think about how you might split a sentence into pieces to help a computer read it.
You got /3 concepts.
    Describe the role of vocabulary in a language model and what happens when the model encounters a token not in its vocabulary.
    Consider how a dictionary helps you understand words, and what you do if a word is missing.
    You got /3 concepts.

      Practice

      (1/5)
      1. What does tokenization do in natural language processing?
      easy
      A. Converts tokens into images
      B. Breaks text into smaller pieces called tokens
      C. Removes all punctuation from text
      D. Combines multiple texts into one

      Solution

      1. Step 1: Understand the role of tokenization

        Tokenization splits text into smaller parts called tokens, like words or subwords.
      2. Step 2: Compare options with tokenization definition

        Only Breaks text into smaller pieces called tokens correctly describes breaking text into tokens.
      3. Final Answer:

        Breaks text into smaller pieces called tokens -> Option B
      4. Quick Check:

        Tokenization = splitting text [OK]
      Hint: Tokenization means splitting text into pieces [OK]
      Common Mistakes:
      • Thinking tokenization changes text to images
      • Confusing tokenization with removing punctuation
      • Believing tokenization merges texts
      2. Which of the following is the correct way to represent a token ID in Python?
      easy
      A. token_id = 'word'
      B. token_id = {word: 1}
      C. token_id = [word]
      D. token_id = 123

      Solution

      1. Step 1: Understand token ID representation

        Token IDs are numbers representing tokens, so they should be integers.
      2. Step 2: Check each option's type

        token_id = 123 assigns an integer 123, which is correct. Others use strings, lists, or dictionaries incorrectly.
      3. Final Answer:

        token_id = 123 -> Option D
      4. Quick Check:

        Token ID = number [OK]
      Hint: Token IDs are numbers, not words or lists [OK]
      Common Mistakes:
      • Using strings instead of numbers for token IDs
      • Confusing token IDs with token text
      • Using lists or dictionaries wrongly
      3. Given the vocabulary {'hello': 1, 'world': 2, '!': 3}, what is the token ID list for the text 'hello world!'?
      medium
      A. [1, 2, 3]
      B. [0, 1, 2]
      C. ['hello', 'world', '!']
      D. [3, 2, 1]

      Solution

      1. Step 1: Map each word to its token ID

        'hello' maps to 1, 'world' maps to 2, and '!' maps to 3 according to the vocabulary.
      2. Step 2: Create the token ID list in order

        The text 'hello world!' becomes [1, 2, 3].
      3. Final Answer:

        [1, 2, 3] -> Option A
      4. Quick Check:

        Text tokens = [1, 2, 3] [OK]
      Hint: Match words to IDs in order [OK]
      Common Mistakes:
      • Mixing up token order
      • Using token text instead of IDs
      • Assigning wrong IDs from vocabulary
      4. What is wrong with this tokenization code snippet?
      vocab = {'hi': 1, 'there': 2}
      text = 'hi there'
      tokens = [vocab[word] for word in text.split() if word in vocab]
      medium
      A. It will raise a KeyError if a word is missing
      B. It correctly tokenizes the text
      C. It ignores words not in vocabulary
      D. It uses split() incorrectly on the text

      Solution

      1. Step 1: Analyze the list comprehension

        The code splits text and includes only words found in vocab, skipping others.
      2. Step 2: Identify behavior on unknown words

        Words not in vocab are ignored, which may lose information.
      3. Final Answer:

        It ignores words not in vocabulary -> Option C
      4. Quick Check:

        Unknown words skipped = ignoring [OK]
      Hint: Check if unknown words are skipped or cause errors [OK]
      Common Mistakes:
      • Assuming KeyError will happen due to 'if' check
      • Thinking split() is wrong here
      • Missing that unknown words are ignored silently
      5. You have a vocabulary with tokens: {'I':1, 'love':2, 'AI':3, '.':4}. How would you tokenize the sentence 'I love AI!' considering the exclamation mark is not in the vocabulary?
      hard
      A. Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5]
      B. Replace '!' with '.' and tokenize as [1, 2, 3, 4]
      C. Ignore '!' and tokenize as [1, 2, 3]
      D. Raise an error because '!' is unknown

      Solution

      1. Step 1: Understand vocabulary coverage

        The vocabulary lacks '!', so it must be added to handle the sentence fully.
      2. Step 2: Add '!' with a new token ID

        Assign '!' a new ID (e.g., 5) and tokenize the sentence as [1, 2, 3, 5].
      3. Final Answer:

        Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5] -> Option A
      4. Quick Check:

        Unknown token added = new ID [OK]
      Hint: Add unknown tokens to vocabulary before tokenizing [OK]
      Common Mistakes:
      • Ignoring unknown tokens silently
      • Replacing unknown tokens incorrectly
      • Assuming error without handling unknown tokens