0
0
Prompt Engineering / GenAIml~15 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - Deep Dive

Choose your learning style9 modes available
Overview - Tokenization and vocabulary
What is it?
Tokenization is the process of breaking text into smaller pieces called tokens, which can be words, parts of words, or characters. Vocabulary is the collection of all unique tokens that a model knows and uses to understand and generate language. Together, tokenization and vocabulary help machines read and work with human language by turning sentences into manageable pieces.
Why it matters
Without tokenization and vocabulary, machines would see text as long, confusing strings of letters with no clear meaning. This would make it impossible for AI to understand, learn from, or generate language effectively. Tokenization and vocabulary let AI models handle language in a structured way, enabling everything from chatbots to translation tools to work well.
Where it fits
Before learning tokenization and vocabulary, you should understand basic text data and how computers represent information. After this, you can learn about embedding tokens into numbers and how models use these embeddings to learn language patterns.
Mental Model
Core Idea
Tokenization breaks text into known pieces, and vocabulary is the set of those pieces that a model understands and uses to read and write language.
Think of it like...
Tokenization and vocabulary are like cutting a long sentence into puzzle pieces and having a box of known pieces to build or understand the picture. If a piece is missing from the box, the puzzle can't be completed properly.
Text input
  │
  ▼
┌─────────────┐
│ Tokenizer   │  -- splits text into tokens
└─────────────┘
      │
      ▼
┌─────────────┐
│ Tokens      │  -- pieces like words or subwords
└─────────────┘
      │
      ▼
┌─────────────┐
│ Vocabulary  │  -- known tokens the model uses
└─────────────┘
Build-Up - 7 Steps
1
FoundationWhat is Tokenization in Text
🤔
Concept: Tokenization means splitting text into smaller parts called tokens.
Imagine a sentence: "I love cats." Tokenization breaks it into tokens like ['I', 'love', 'cats', '.']. These tokens are easier for a computer to handle than the whole sentence at once.
Result
The sentence is now a list of tokens: ['I', 'love', 'cats', '.']
Understanding tokenization is key because it turns raw text into pieces that machines can process step-by-step.
2
FoundationUnderstanding Vocabulary in Models
🤔
Concept: Vocabulary is the set of all tokens a model knows and uses.
If a model's vocabulary includes ['I', 'love', 'cats', '.'], it can understand and generate these tokens. Tokens not in the vocabulary are unknown and cause problems.
Result
The model can recognize and work with tokens only if they are in its vocabulary.
Knowing vocabulary limits helps explain why some words or symbols confuse AI models.
3
IntermediateDifferent Tokenization Methods
🤔Before reading on: do you think tokenization always splits text by spaces, or can it be more complex? Commit to your answer.
Concept: Tokenization can split text by words, subwords, or characters depending on the method.
Word-level tokenization splits by spaces, e.g., 'cats' is one token. Subword tokenization breaks words into smaller parts like 'cat' + 's'. Character-level tokenization splits every letter as a token.
Result
Different tokenization methods produce different token lists for the same text.
Understanding tokenization types helps choose the right method for balancing vocabulary size and model flexibility.
4
IntermediateWhy Subword Tokenization is Popular
🤔Before reading on: do you think using whole words or subwords leads to smaller vocabulary sizes? Commit to your answer.
Concept: Subword tokenization reduces vocabulary size and handles unknown words better.
Subword methods like Byte Pair Encoding (BPE) break rare words into common parts. For example, 'unhappiness' might be split into 'un', 'happi', 'ness'. This lets models understand new words by combining known pieces.
Result
Models can handle new or rare words without needing huge vocabularies.
Knowing why subword tokenization works explains how models generalize to words they never saw during training.
5
IntermediateBuilding and Using a Vocabulary
🤔
Concept: Vocabulary is built from training data tokens and used to map tokens to numbers.
During training, all tokens found are collected into a vocabulary list. Each token gets a unique number called an ID. When processing text, tokens are replaced by their IDs so models can work with numbers.
Result
Text is converted into sequences of token IDs, like [12, 45, 78].
Understanding vocabulary as a mapping to numbers is crucial because models only understand numbers, not text.
6
AdvancedHandling Unknown Tokens in Practice
🤔Before reading on: do you think models ignore unknown tokens or replace them with a special token? Commit to your answer.
Concept: Models use special tokens to represent unknown or rare tokens not in vocabulary.
When a token is not in the vocabulary, it is replaced by a special token like . This tells the model 'unknown word here' so it can still process the input without crashing.
Result
Text with unknown words is still processed, but with less precise understanding.
Knowing how unknown tokens are handled helps explain model errors and guides vocabulary design.
7
ExpertVocabulary Size Trade-offs and Model Performance
🤔Before reading on: do you think bigger vocabularies always improve model accuracy? Commit to your answer.
Concept: Vocabulary size affects model size, speed, and ability to generalize, requiring careful balance.
A very large vocabulary means the model has many tokens to learn, increasing memory and slowing training. A very small vocabulary means more tokens are broken into subwords or characters, which can make sequences longer and harder to learn. Experts tune vocabulary size to balance these trade-offs for best performance.
Result
Choosing vocabulary size impacts model efficiency and accuracy in real-world tasks.
Understanding vocabulary size trade-offs is key to optimizing models for production use.
Under the Hood
Tokenization works by scanning text and applying rules or learned merges to split it into tokens. Vocabulary is stored as a lookup table mapping tokens to unique IDs. During model input, text is tokenized, tokens are converted to IDs, and these IDs are fed into embedding layers that turn them into vectors for the model to process.
Why designed this way?
Tokenization and vocabulary were designed to convert messy, variable-length text into fixed, manageable units that models can handle numerically. Early models used word-level vocabularies but faced huge size and unknown word problems. Subword tokenization and controlled vocabulary sizes were introduced to solve these issues and improve generalization.
Input Text
   │
   ▼
┌───────────────┐
│ Tokenizer     │
│ (rules/learned│
│  merges)      │
└───────────────┘
   │
   ▼
┌───────────────┐
│ Tokens        │
│ (strings)     │
└───────────────┘
   │
   ▼
┌───────────────┐
│ Vocabulary    │
│ (token → ID)  │
└───────────────┘
   │
   ▼
┌───────────────┐
│ Embedding     │
│ Layer         │
│ (ID → vector) │
└───────────────┘
   │
   ▼
Model Input Vectors
Myth Busters - 4 Common Misconceptions
Quick: Do you think tokenization always splits text only by spaces? Commit to yes or no.
Common Belief:Tokenization just splits text by spaces between words.
Tap to reveal reality
Reality:Tokenization can split text into subwords or characters, not just by spaces.
Why it matters:Assuming only space splitting limits understanding of how models handle rare or new words.
Quick: Do you think a bigger vocabulary always means better model understanding? Commit to yes or no.
Common Belief:The bigger the vocabulary, the better the model understands language.
Tap to reveal reality
Reality:Too large vocabularies increase model size and slow training without always improving understanding.
Why it matters:Ignoring vocabulary size trade-offs can lead to inefficient models that are hard to deploy.
Quick: Do you think unknown words cause models to fail completely? Commit to yes or no.
Common Belief:If a word is not in the vocabulary, the model cannot process the text at all.
Tap to reveal reality
Reality:Models replace unknown words with special tokens to continue processing, though with less precision.
Why it matters:Knowing this prevents panic when models encounter new words and helps design better vocabularies.
Quick: Do you think vocabulary is fixed and cannot change after training? Commit to yes or no.
Common Belief:Vocabulary is fixed once the model is trained and cannot be updated.
Tap to reveal reality
Reality:Some models support vocabulary updates or use dynamic tokenization methods to adapt to new words.
Why it matters:Understanding vocabulary flexibility helps in maintaining and improving models over time.
Expert Zone
1
Subword tokenization algorithms like BPE or WordPiece differ subtly in how they merge tokens, affecting model behavior.
2
Vocabulary choice impacts not only model size but also token embedding quality and downstream task performance.
3
Special tokens like padding, start/end, and unknown tokens play critical roles in model training and inference.
When NOT to use
Tokenization and fixed vocabulary approaches are less suitable for languages with complex morphology or for tasks requiring open vocabulary generation. Alternatives include character-level models or byte-level tokenization that avoid fixed vocabularies.
Production Patterns
In production, tokenization and vocabulary are often optimized for speed and memory. Models use precompiled vocabularies and tokenizers, sometimes with caching. Vocabulary pruning or merging is applied to reduce size without losing accuracy.
Connections
Data Compression
Tokenization algorithms like BPE are inspired by data compression techniques that find common patterns to reduce size.
Understanding compression helps grasp why subword tokenization merges frequent parts to build efficient vocabularies.
Human Language Learning
Vocabulary acquisition in AI models parallels how humans learn words and break down unfamiliar words into known parts.
Knowing how humans learn language helps design tokenization methods that mimic natural understanding.
Digital Signal Processing
Tokenization converts continuous text into discrete units, similar to how signals are sampled into discrete values.
This connection shows how continuous information is discretized for machine processing across fields.
Common Pitfalls
#1Using a vocabulary that is too small, causing excessive token splitting and long sequences.
Wrong approach:Tokenizer with vocabulary size 1000 applied to complex text, resulting in many tiny tokens.
Correct approach:Choose a balanced vocabulary size (e.g., 30,000) to reduce token fragmentation while controlling model size.
Root cause:Misunderstanding the trade-off between vocabulary size and token sequence length.
#2Ignoring unknown tokens and letting them crash the model during inference.
Wrong approach:Feeding text with unknown words directly without replacing or handling them.
Correct approach:Replace unknown tokens with a special token before model input.
Root cause:Not accounting for vocabulary limits and special token handling.
#3Assuming tokenization is language-agnostic and using the same tokenizer for all languages.
Wrong approach:Applying English word-level tokenizer to Chinese text without adjustment.
Correct approach:Use language-specific tokenizers or subword methods that handle language morphology properly.
Root cause:Overlooking language differences in text structure and tokenization needs.
Key Takeaways
Tokenization breaks text into manageable pieces called tokens, which are the building blocks for language models.
Vocabulary is the set of tokens a model knows, mapping each token to a unique ID for processing.
Different tokenization methods like word, subword, and character-level affect vocabulary size and model flexibility.
Handling unknown tokens with special tokens allows models to process new or rare words without failure.
Choosing the right vocabulary size balances model performance, size, and ability to generalize to new language.