Bird
Raised Fist0
Prompt Engineering / GenAIml~15 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - Deep Dive

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Overview - Tokenization and vocabulary
What is it?
Tokenization is the process of breaking text into smaller pieces called tokens, which can be words, parts of words, or characters. Vocabulary is the collection of all unique tokens that a model knows and uses to understand and generate language. Together, tokenization and vocabulary help machines read and work with human language by turning sentences into manageable pieces.
Why it matters
Without tokenization and vocabulary, machines would see text as long, confusing strings of letters with no clear meaning. This would make it impossible for AI to understand, learn from, or generate language effectively. Tokenization and vocabulary let AI models handle language in a structured way, enabling everything from chatbots to translation tools to work well.
Where it fits
Before learning tokenization and vocabulary, you should understand basic text data and how computers represent information. After this, you can learn about embedding tokens into numbers and how models use these embeddings to learn language patterns.
Mental Model
Core Idea
Tokenization breaks text into known pieces, and vocabulary is the set of those pieces that a model understands and uses to read and write language.
Think of it like...
Tokenization and vocabulary are like cutting a long sentence into puzzle pieces and having a box of known pieces to build or understand the picture. If a piece is missing from the box, the puzzle can't be completed properly.
Text input
  │
  ▼
┌─────────────┐
│ Tokenizer   │  -- splits text into tokens
└─────────────┘
      │
      ▼
┌─────────────┐
│ Tokens      │  -- pieces like words or subwords
└─────────────┘
      │
      ▼
┌─────────────┐
│ Vocabulary  │  -- known tokens the model uses
└─────────────┘
Build-Up - 7 Steps
1
FoundationWhat is Tokenization in Text
🤔
Concept: Tokenization means splitting text into smaller parts called tokens.
Imagine a sentence: "I love cats." Tokenization breaks it into tokens like ['I', 'love', 'cats', '.']. These tokens are easier for a computer to handle than the whole sentence at once.
Result
The sentence is now a list of tokens: ['I', 'love', 'cats', '.']
Understanding tokenization is key because it turns raw text into pieces that machines can process step-by-step.
2
FoundationUnderstanding Vocabulary in Models
🤔
Concept: Vocabulary is the set of all tokens a model knows and uses.
If a model's vocabulary includes ['I', 'love', 'cats', '.'], it can understand and generate these tokens. Tokens not in the vocabulary are unknown and cause problems.
Result
The model can recognize and work with tokens only if they are in its vocabulary.
Knowing vocabulary limits helps explain why some words or symbols confuse AI models.
3
IntermediateDifferent Tokenization Methods
🤔Before reading on: do you think tokenization always splits text by spaces, or can it be more complex? Commit to your answer.
Concept: Tokenization can split text by words, subwords, or characters depending on the method.
Word-level tokenization splits by spaces, e.g., 'cats' is one token. Subword tokenization breaks words into smaller parts like 'cat' + 's'. Character-level tokenization splits every letter as a token.
Result
Different tokenization methods produce different token lists for the same text.
Understanding tokenization types helps choose the right method for balancing vocabulary size and model flexibility.
4
IntermediateWhy Subword Tokenization is Popular
🤔Before reading on: do you think using whole words or subwords leads to smaller vocabulary sizes? Commit to your answer.
Concept: Subword tokenization reduces vocabulary size and handles unknown words better.
Subword methods like Byte Pair Encoding (BPE) break rare words into common parts. For example, 'unhappiness' might be split into 'un', 'happi', 'ness'. This lets models understand new words by combining known pieces.
Result
Models can handle new or rare words without needing huge vocabularies.
Knowing why subword tokenization works explains how models generalize to words they never saw during training.
5
IntermediateBuilding and Using a Vocabulary
🤔
Concept: Vocabulary is built from training data tokens and used to map tokens to numbers.
During training, all tokens found are collected into a vocabulary list. Each token gets a unique number called an ID. When processing text, tokens are replaced by their IDs so models can work with numbers.
Result
Text is converted into sequences of token IDs, like [12, 45, 78].
Understanding vocabulary as a mapping to numbers is crucial because models only understand numbers, not text.
6
AdvancedHandling Unknown Tokens in Practice
🤔Before reading on: do you think models ignore unknown tokens or replace them with a special token? Commit to your answer.
Concept: Models use special tokens to represent unknown or rare tokens not in vocabulary.
When a token is not in the vocabulary, it is replaced by a special token like . This tells the model 'unknown word here' so it can still process the input without crashing.
Result
Text with unknown words is still processed, but with less precise understanding.
Knowing how unknown tokens are handled helps explain model errors and guides vocabulary design.
7
ExpertVocabulary Size Trade-offs and Model Performance
🤔Before reading on: do you think bigger vocabularies always improve model accuracy? Commit to your answer.
Concept: Vocabulary size affects model size, speed, and ability to generalize, requiring careful balance.
A very large vocabulary means the model has many tokens to learn, increasing memory and slowing training. A very small vocabulary means more tokens are broken into subwords or characters, which can make sequences longer and harder to learn. Experts tune vocabulary size to balance these trade-offs for best performance.
Result
Choosing vocabulary size impacts model efficiency and accuracy in real-world tasks.
Understanding vocabulary size trade-offs is key to optimizing models for production use.
Under the Hood
Tokenization works by scanning text and applying rules or learned merges to split it into tokens. Vocabulary is stored as a lookup table mapping tokens to unique IDs. During model input, text is tokenized, tokens are converted to IDs, and these IDs are fed into embedding layers that turn them into vectors for the model to process.
Why designed this way?
Tokenization and vocabulary were designed to convert messy, variable-length text into fixed, manageable units that models can handle numerically. Early models used word-level vocabularies but faced huge size and unknown word problems. Subword tokenization and controlled vocabulary sizes were introduced to solve these issues and improve generalization.
Input Text
   │
   ▼
┌───────────────┐
│ Tokenizer     │
│ (rules/learned│
│  merges)      │
└───────────────┘
   │
   ▼
┌───────────────┐
│ Tokens        │
│ (strings)     │
└───────────────┘
   │
   ▼
┌───────────────┐
│ Vocabulary    │
│ (token → ID)  │
└───────────────┘
   │
   ▼
┌───────────────┐
│ Embedding     │
│ Layer         │
│ (ID → vector) │
└───────────────┘
   │
   ▼
Model Input Vectors
Myth Busters - 4 Common Misconceptions
Quick: Do you think tokenization always splits text only by spaces? Commit to yes or no.
Common Belief:Tokenization just splits text by spaces between words.
Tap to reveal reality
Reality:Tokenization can split text into subwords or characters, not just by spaces.
Why it matters:Assuming only space splitting limits understanding of how models handle rare or new words.
Quick: Do you think a bigger vocabulary always means better model understanding? Commit to yes or no.
Common Belief:The bigger the vocabulary, the better the model understands language.
Tap to reveal reality
Reality:Too large vocabularies increase model size and slow training without always improving understanding.
Why it matters:Ignoring vocabulary size trade-offs can lead to inefficient models that are hard to deploy.
Quick: Do you think unknown words cause models to fail completely? Commit to yes or no.
Common Belief:If a word is not in the vocabulary, the model cannot process the text at all.
Tap to reveal reality
Reality:Models replace unknown words with special tokens to continue processing, though with less precision.
Why it matters:Knowing this prevents panic when models encounter new words and helps design better vocabularies.
Quick: Do you think vocabulary is fixed and cannot change after training? Commit to yes or no.
Common Belief:Vocabulary is fixed once the model is trained and cannot be updated.
Tap to reveal reality
Reality:Some models support vocabulary updates or use dynamic tokenization methods to adapt to new words.
Why it matters:Understanding vocabulary flexibility helps in maintaining and improving models over time.
Expert Zone
1
Subword tokenization algorithms like BPE or WordPiece differ subtly in how they merge tokens, affecting model behavior.
2
Vocabulary choice impacts not only model size but also token embedding quality and downstream task performance.
3
Special tokens like padding, start/end, and unknown tokens play critical roles in model training and inference.
When NOT to use
Tokenization and fixed vocabulary approaches are less suitable for languages with complex morphology or for tasks requiring open vocabulary generation. Alternatives include character-level models or byte-level tokenization that avoid fixed vocabularies.
Production Patterns
In production, tokenization and vocabulary are often optimized for speed and memory. Models use precompiled vocabularies and tokenizers, sometimes with caching. Vocabulary pruning or merging is applied to reduce size without losing accuracy.
Connections
Data Compression
Tokenization algorithms like BPE are inspired by data compression techniques that find common patterns to reduce size.
Understanding compression helps grasp why subword tokenization merges frequent parts to build efficient vocabularies.
Human Language Learning
Vocabulary acquisition in AI models parallels how humans learn words and break down unfamiliar words into known parts.
Knowing how humans learn language helps design tokenization methods that mimic natural understanding.
Digital Signal Processing
Tokenization converts continuous text into discrete units, similar to how signals are sampled into discrete values.
This connection shows how continuous information is discretized for machine processing across fields.
Common Pitfalls
#1Using a vocabulary that is too small, causing excessive token splitting and long sequences.
Wrong approach:Tokenizer with vocabulary size 1000 applied to complex text, resulting in many tiny tokens.
Correct approach:Choose a balanced vocabulary size (e.g., 30,000) to reduce token fragmentation while controlling model size.
Root cause:Misunderstanding the trade-off between vocabulary size and token sequence length.
#2Ignoring unknown tokens and letting them crash the model during inference.
Wrong approach:Feeding text with unknown words directly without replacing or handling them.
Correct approach:Replace unknown tokens with a special token before model input.
Root cause:Not accounting for vocabulary limits and special token handling.
#3Assuming tokenization is language-agnostic and using the same tokenizer for all languages.
Wrong approach:Applying English word-level tokenizer to Chinese text without adjustment.
Correct approach:Use language-specific tokenizers or subword methods that handle language morphology properly.
Root cause:Overlooking language differences in text structure and tokenization needs.
Key Takeaways
Tokenization breaks text into manageable pieces called tokens, which are the building blocks for language models.
Vocabulary is the set of tokens a model knows, mapping each token to a unique ID for processing.
Different tokenization methods like word, subword, and character-level affect vocabulary size and model flexibility.
Handling unknown tokens with special tokens allows models to process new or rare words without failure.
Choosing the right vocabulary size balances model performance, size, and ability to generalize to new language.

Practice

(1/5)
1. What does tokenization do in natural language processing?
easy
A. Converts tokens into images
B. Breaks text into smaller pieces called tokens
C. Removes all punctuation from text
D. Combines multiple texts into one

Solution

  1. Step 1: Understand the role of tokenization

    Tokenization splits text into smaller parts called tokens, like words or subwords.
  2. Step 2: Compare options with tokenization definition

    Only Breaks text into smaller pieces called tokens correctly describes breaking text into tokens.
  3. Final Answer:

    Breaks text into smaller pieces called tokens -> Option B
  4. Quick Check:

    Tokenization = splitting text [OK]
Hint: Tokenization means splitting text into pieces [OK]
Common Mistakes:
  • Thinking tokenization changes text to images
  • Confusing tokenization with removing punctuation
  • Believing tokenization merges texts
2. Which of the following is the correct way to represent a token ID in Python?
easy
A. token_id = 'word'
B. token_id = {word: 1}
C. token_id = [word]
D. token_id = 123

Solution

  1. Step 1: Understand token ID representation

    Token IDs are numbers representing tokens, so they should be integers.
  2. Step 2: Check each option's type

    token_id = 123 assigns an integer 123, which is correct. Others use strings, lists, or dictionaries incorrectly.
  3. Final Answer:

    token_id = 123 -> Option D
  4. Quick Check:

    Token ID = number [OK]
Hint: Token IDs are numbers, not words or lists [OK]
Common Mistakes:
  • Using strings instead of numbers for token IDs
  • Confusing token IDs with token text
  • Using lists or dictionaries wrongly
3. Given the vocabulary {'hello': 1, 'world': 2, '!': 3}, what is the token ID list for the text 'hello world!'?
medium
A. [1, 2, 3]
B. [0, 1, 2]
C. ['hello', 'world', '!']
D. [3, 2, 1]

Solution

  1. Step 1: Map each word to its token ID

    'hello' maps to 1, 'world' maps to 2, and '!' maps to 3 according to the vocabulary.
  2. Step 2: Create the token ID list in order

    The text 'hello world!' becomes [1, 2, 3].
  3. Final Answer:

    [1, 2, 3] -> Option A
  4. Quick Check:

    Text tokens = [1, 2, 3] [OK]
Hint: Match words to IDs in order [OK]
Common Mistakes:
  • Mixing up token order
  • Using token text instead of IDs
  • Assigning wrong IDs from vocabulary
4. What is wrong with this tokenization code snippet?
vocab = {'hi': 1, 'there': 2}
text = 'hi there'
tokens = [vocab[word] for word in text.split() if word in vocab]
medium
A. It will raise a KeyError if a word is missing
B. It correctly tokenizes the text
C. It ignores words not in vocabulary
D. It uses split() incorrectly on the text

Solution

  1. Step 1: Analyze the list comprehension

    The code splits text and includes only words found in vocab, skipping others.
  2. Step 2: Identify behavior on unknown words

    Words not in vocab are ignored, which may lose information.
  3. Final Answer:

    It ignores words not in vocabulary -> Option C
  4. Quick Check:

    Unknown words skipped = ignoring [OK]
Hint: Check if unknown words are skipped or cause errors [OK]
Common Mistakes:
  • Assuming KeyError will happen due to 'if' check
  • Thinking split() is wrong here
  • Missing that unknown words are ignored silently
5. You have a vocabulary with tokens: {'I':1, 'love':2, 'AI':3, '.':4}. How would you tokenize the sentence 'I love AI!' considering the exclamation mark is not in the vocabulary?
hard
A. Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5]
B. Replace '!' with '.' and tokenize as [1, 2, 3, 4]
C. Ignore '!' and tokenize as [1, 2, 3]
D. Raise an error because '!' is unknown

Solution

  1. Step 1: Understand vocabulary coverage

    The vocabulary lacks '!', so it must be added to handle the sentence fully.
  2. Step 2: Add '!' with a new token ID

    Assign '!' a new ID (e.g., 5) and tokenize the sentence as [1, 2, 3, 5].
  3. Final Answer:

    Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5] -> Option A
  4. Quick Check:

    Unknown token added = new ID [OK]
Hint: Add unknown tokens to vocabulary before tokenizing [OK]
Common Mistakes:
  • Ignoring unknown tokens silently
  • Replacing unknown tokens incorrectly
  • Assuming error without handling unknown tokens