Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Tokenization and vocabulary
Which metric matters for Tokenization and vocabulary and WHY

For tokenization and vocabulary, the key metrics are token coverage and out-of-vocabulary (OOV) rate. Token coverage measures how well the vocabulary represents the input text. A high coverage means most words or subwords are recognized by the model. The OOV rate shows how many words are not in the vocabulary, which can cause the model to struggle understanding or generating text. These metrics matter because good tokenization helps the model learn and predict better by breaking text into meaningful pieces it knows.

Confusion matrix or equivalent visualization
Vocabulary Size: 10,000 tokens

Example:
Total words in text: 1000
Known tokens (in vocabulary): 950
Unknown tokens (OOV): 50

Token Coverage = 950 / 1000 = 95%
OOV Rate = 50 / 1000 = 5%

This simple count shows how many tokens the model can handle well versus unknown ones.
Precision vs Recall tradeoff with concrete examples

In tokenization, think of precision as how accurately tokens represent real words or meaningful parts, and recall as how many real words are captured by the vocabulary.

Example 1: High precision, low recall
Vocabulary has very specific tokens, so each token is very meaningful (high precision). But many words are missing, so many tokens are unknown (low recall). This can confuse the model on new text.

Example 2: High recall, low precision
Vocabulary includes many tokens, even rare or noisy ones. Most words are covered (high recall), but some tokens are too small or meaningless (low precision). This can make the model slower and less clear.

The goal is to balance token coverage (recall) and meaningful tokens (precision) for best model understanding.

What "good" vs "bad" metric values look like for this use case
  • Good: Token coverage above 95%, OOV rate below 5%. Vocabulary size balanced to cover most words without too many rare tokens.
  • Bad: Token coverage below 80%, OOV rate above 20%. Many unknown tokens cause poor model understanding and errors.
  • Too large vocabulary can slow training and increase memory use without big gains.
  • Too small vocabulary leads to many unknown tokens and poor text representation.
Metrics pitfalls
  • Ignoring OOV rate: High accuracy on training data can hide many unknown tokens in new text, causing poor real-world performance.
  • Overfitting vocabulary: Vocabulary too tuned to training data may not generalize to new words or languages.
  • Data leakage: Including test data words in vocabulary inflates coverage and misleads evaluation.
  • Ignoring token granularity: Very small tokens (like single letters) increase coverage but reduce meaningfulness.
Self-check question

Your tokenizer has 98% token coverage but a vocabulary size of 100,000 tokens. Is this good? Why or why not?

Answer: While 98% coverage is high, a vocabulary of 100,000 tokens is very large and may slow down the model and require more memory. It might include many rare or unnecessary tokens. A smaller vocabulary with slightly lower coverage (e.g., 95%) could be more efficient and still effective. So, this setup might not be ideal for practical use.

Key Result
Token coverage and out-of-vocabulary rate are key metrics to evaluate tokenization quality and vocabulary effectiveness.

Practice

(1/5)
1. What does tokenization do in natural language processing?
easy
A. Converts tokens into images
B. Breaks text into smaller pieces called tokens
C. Removes all punctuation from text
D. Combines multiple texts into one

Solution

  1. Step 1: Understand the role of tokenization

    Tokenization splits text into smaller parts called tokens, like words or subwords.
  2. Step 2: Compare options with tokenization definition

    Only Breaks text into smaller pieces called tokens correctly describes breaking text into tokens.
  3. Final Answer:

    Breaks text into smaller pieces called tokens -> Option B
  4. Quick Check:

    Tokenization = splitting text [OK]
Hint: Tokenization means splitting text into pieces [OK]
Common Mistakes:
  • Thinking tokenization changes text to images
  • Confusing tokenization with removing punctuation
  • Believing tokenization merges texts
2. Which of the following is the correct way to represent a token ID in Python?
easy
A. token_id = 'word'
B. token_id = {word: 1}
C. token_id = [word]
D. token_id = 123

Solution

  1. Step 1: Understand token ID representation

    Token IDs are numbers representing tokens, so they should be integers.
  2. Step 2: Check each option's type

    token_id = 123 assigns an integer 123, which is correct. Others use strings, lists, or dictionaries incorrectly.
  3. Final Answer:

    token_id = 123 -> Option D
  4. Quick Check:

    Token ID = number [OK]
Hint: Token IDs are numbers, not words or lists [OK]
Common Mistakes:
  • Using strings instead of numbers for token IDs
  • Confusing token IDs with token text
  • Using lists or dictionaries wrongly
3. Given the vocabulary {'hello': 1, 'world': 2, '!': 3}, what is the token ID list for the text 'hello world!'?
medium
A. [1, 2, 3]
B. [0, 1, 2]
C. ['hello', 'world', '!']
D. [3, 2, 1]

Solution

  1. Step 1: Map each word to its token ID

    'hello' maps to 1, 'world' maps to 2, and '!' maps to 3 according to the vocabulary.
  2. Step 2: Create the token ID list in order

    The text 'hello world!' becomes [1, 2, 3].
  3. Final Answer:

    [1, 2, 3] -> Option A
  4. Quick Check:

    Text tokens = [1, 2, 3] [OK]
Hint: Match words to IDs in order [OK]
Common Mistakes:
  • Mixing up token order
  • Using token text instead of IDs
  • Assigning wrong IDs from vocabulary
4. What is wrong with this tokenization code snippet?
vocab = {'hi': 1, 'there': 2}
text = 'hi there'
tokens = [vocab[word] for word in text.split() if word in vocab]
medium
A. It will raise a KeyError if a word is missing
B. It correctly tokenizes the text
C. It ignores words not in vocabulary
D. It uses split() incorrectly on the text

Solution

  1. Step 1: Analyze the list comprehension

    The code splits text and includes only words found in vocab, skipping others.
  2. Step 2: Identify behavior on unknown words

    Words not in vocab are ignored, which may lose information.
  3. Final Answer:

    It ignores words not in vocabulary -> Option C
  4. Quick Check:

    Unknown words skipped = ignoring [OK]
Hint: Check if unknown words are skipped or cause errors [OK]
Common Mistakes:
  • Assuming KeyError will happen due to 'if' check
  • Thinking split() is wrong here
  • Missing that unknown words are ignored silently
5. You have a vocabulary with tokens: {'I':1, 'love':2, 'AI':3, '.':4}. How would you tokenize the sentence 'I love AI!' considering the exclamation mark is not in the vocabulary?
hard
A. Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5]
B. Replace '!' with '.' and tokenize as [1, 2, 3, 4]
C. Ignore '!' and tokenize as [1, 2, 3]
D. Raise an error because '!' is unknown

Solution

  1. Step 1: Understand vocabulary coverage

    The vocabulary lacks '!', so it must be added to handle the sentence fully.
  2. Step 2: Add '!' with a new token ID

    Assign '!' a new ID (e.g., 5) and tokenize the sentence as [1, 2, 3, 5].
  3. Final Answer:

    Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5] -> Option A
  4. Quick Check:

    Unknown token added = new ID [OK]
Hint: Add unknown tokens to vocabulary before tokenizing [OK]
Common Mistakes:
  • Ignoring unknown tokens silently
  • Replacing unknown tokens incorrectly
  • Assuming error without handling unknown tokens