Prompt Engineering / GenAIml~15 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - Deep Dive

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Overview - Tokenization and vocabulary

What is it?

Tokenization is the process of breaking text into smaller pieces called tokens, which can be words, parts of words, or characters. Vocabulary is the collection of all unique tokens that a model knows and uses to understand and generate language. Together, tokenization and vocabulary help machines read and work with human language by turning sentences into manageable pieces.

Why it matters

Without tokenization and vocabulary, machines would see text as long, confusing strings of letters with no clear meaning. This would make it impossible for AI to understand, learn from, or generate language effectively. Tokenization and vocabulary let AI models handle language in a structured way, enabling everything from chatbots to translation tools to work well.

Where it fits

Before learning tokenization and vocabulary, you should understand basic text data and how computers represent information. After this, you can learn about embedding tokens into numbers and how models use these embeddings to learn language patterns.

Mental Model

Core Idea

Tokenization breaks text into known pieces, and vocabulary is the set of those pieces that a model understands and uses to read and write language.

Think of it like...

Tokenization and vocabulary are like cutting a long sentence into puzzle pieces and having a box of known pieces to build or understand the picture. If a piece is missing from the box, the puzzle can't be completed properly.

Text input
  │
  ▼
┌─────────────┐
│ Tokenizer   │  -- splits text into tokens
└─────────────┘
      │
      ▼
┌─────────────┐
│ Tokens      │  -- pieces like words or subwords
└─────────────┘
      │
      ▼
┌─────────────┐
│ Vocabulary  │  -- known tokens the model uses
└─────────────┘

Build-Up - 7 Steps

FoundationWhat is Tokenization in Text

Concept: Tokenization means splitting text into smaller parts called tokens.

Imagine a sentence: "I love cats." Tokenization breaks it into tokens like ['I', 'love', 'cats', '.']. These tokens are easier for a computer to handle than the whole sentence at once.

Result

The sentence is now a list of tokens: ['I', 'love', 'cats', '.']

Understanding tokenization is key because it turns raw text into pieces that machines can process step-by-step.

FoundationUnderstanding Vocabulary in Models

IntermediateDifferent Tokenization Methods

IntermediateWhy Subword Tokenization is Popular

IntermediateBuilding and Using a Vocabulary

AdvancedHandling Unknown Tokens in Practice

ExpertVocabulary Size Trade-offs and Model Performance

Under the Hood

Tokenization works by scanning text and applying rules or learned merges to split it into tokens. Vocabulary is stored as a lookup table mapping tokens to unique IDs. During model input, text is tokenized, tokens are converted to IDs, and these IDs are fed into embedding layers that turn them into vectors for the model to process.

Why designed this way?

Tokenization and vocabulary were designed to convert messy, variable-length text into fixed, manageable units that models can handle numerically. Early models used word-level vocabularies but faced huge size and unknown word problems. Subword tokenization and controlled vocabulary sizes were introduced to solve these issues and improve generalization.

Input Text
   │
   ▼
┌───────────────┐
│ Tokenizer     │
│ (rules/learned│
│  merges)      │
└───────────────┘
   │
   ▼
┌───────────────┐
│ Tokens        │
│ (strings)     │
└───────────────┘
   │
   ▼
┌───────────────┐
│ Vocabulary    │
│ (token → ID)  │
└───────────────┘
   │
   ▼
┌───────────────┐
│ Embedding     │
│ Layer         │
│ (ID → vector) │
└───────────────┘
   │
   ▼
Model Input Vectors

Myth Busters - 4 Common Misconceptions

Quick: Do you think tokenization always splits text only by spaces? Commit to yes or no.

Common Belief:Tokenization just splits text by spaces between words.

Tap to reveal reality

Quick: Do you think a bigger vocabulary always means better model understanding? Commit to yes or no.

Common Belief:The bigger the vocabulary, the better the model understands language.

Tap to reveal reality

Quick: Do you think unknown words cause models to fail completely? Commit to yes or no.

Common Belief:If a word is not in the vocabulary, the model cannot process the text at all.

Tap to reveal reality

Quick: Do you think vocabulary is fixed and cannot change after training? Commit to yes or no.

Common Belief:Vocabulary is fixed once the model is trained and cannot be updated.

Tap to reveal reality

Expert Zone

Subword tokenization algorithms like BPE or WordPiece differ subtly in how they merge tokens, affecting model behavior.

Vocabulary choice impacts not only model size but also token embedding quality and downstream task performance.

Special tokens like padding, start/end, and unknown tokens play critical roles in model training and inference.

When NOT to use

Tokenization and fixed vocabulary approaches are less suitable for languages with complex morphology or for tasks requiring open vocabulary generation. Alternatives include character-level models or byte-level tokenization that avoid fixed vocabularies.

Production Patterns

In production, tokenization and vocabulary are often optimized for speed and memory. Models use precompiled vocabularies and tokenizers, sometimes with caching. Vocabulary pruning or merging is applied to reduce size without losing accuracy.

Connections

Data Compression

Tokenization algorithms like BPE are inspired by data compression techniques that find common patterns to reduce size.

Understanding compression helps grasp why subword tokenization merges frequent parts to build efficient vocabularies.

Human Language Learning

Vocabulary acquisition in AI models parallels how humans learn words and break down unfamiliar words into known parts.

Knowing how humans learn language helps design tokenization methods that mimic natural understanding.

Digital Signal Processing

Tokenization converts continuous text into discrete units, similar to how signals are sampled into discrete values.

This connection shows how continuous information is discretized for machine processing across fields.

Common Pitfalls

#1Using a vocabulary that is too small, causing excessive token splitting and long sequences.

Wrong approach:Tokenizer with vocabulary size 1000 applied to complex text, resulting in many tiny tokens.

Correct approach:Choose a balanced vocabulary size (e.g., 30,000) to reduce token fragmentation while controlling model size.

Root cause:Misunderstanding the trade-off between vocabulary size and token sequence length.

#2Ignoring unknown tokens and letting them crash the model during inference.

Wrong approach:Feeding text with unknown words directly without replacing or handling them.

Correct approach:Replace unknown tokens with a special token before model input.

Root cause:Not accounting for vocabulary limits and special token handling.

#3Assuming tokenization is language-agnostic and using the same tokenizer for all languages.

Wrong approach:Applying English word-level tokenizer to Chinese text without adjustment.

Correct approach:Use language-specific tokenizers or subword methods that handle language morphology properly.

Root cause:Overlooking language differences in text structure and tokenization needs.

Key Takeaways

Tokenization breaks text into manageable pieces called tokens, which are the building blocks for language models.

Vocabulary is the set of tokens a model knows, mapping each token to a unique ID for processing.

Different tokenization methods like word, subword, and character-level affect vocabulary size and model flexibility.

Handling unknown tokens with special tokens allows models to process new or rare words without failure.

Choosing the right vocabulary size balances model performance, size, and ability to generalize to new language.

Practice

(1/5)

1. What does tokenization do in natural language processing?

easy

A. Converts tokens into images

B. Breaks text into smaller pieces called tokens

C. Removes all punctuation from text

D. Combines multiple texts into one

Tokenization and vocabulary in Prompt Engineering / GenAI - Deep Dive

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of tokenization

Step 2: Compare options with tokenization definition

Final Answer:

Quick Check:

Solution

Step 1: Understand token ID representation

Step 2: Check each option's type

Final Answer:

Quick Check:

Solution

Step 1: Map each word to its token ID

Step 2: Create the token ID list in order

Final Answer:

Quick Check:

Solution

Step 1: Analyze the list comprehension

Step 2: Identify behavior on unknown words

Final Answer:

Quick Check:

Solution

Step 1: Understand vocabulary coverage

Step 2: Add '!' with a new token ID

Final Answer:

Quick Check: