Bird
Raised Fist0
Prompt Engineering / GenAIml~6 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - Full Explanation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
When computers read text, they need a way to break it down into smaller pieces to understand and work with it. Tokenization and vocabulary help solve this by splitting text into manageable parts and knowing what pieces the computer recognizes.
Explanation
Tokenization
Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, parts of words, or even characters depending on the method used. This helps the computer handle text piece by piece instead of as one long string.
Tokenization splits text into smaller, meaningful pieces called tokens.
Types of Tokens
Tokens can be whole words, subwords, or characters. Word tokens treat each word as a unit, while subword tokens break words into smaller parts to handle unknown or rare words better. Character tokens split text into single letters or symbols.
Tokens vary from full words to smaller parts like subwords or characters.
Vocabulary
Vocabulary is the set of all tokens that a model knows and can use. It acts like a dictionary for the computer, listing every token it can recognize. A good vocabulary covers common tokens well and balances size with coverage to work efficiently.
Vocabulary is the list of all tokens a model understands and uses.
Why Tokenization and Vocabulary Matter
These two work together to let computers read and generate text. Tokenization breaks text down, and vocabulary tells the computer what pieces it can work with. This affects how well a model understands language and handles new or complex words.
Tokenization and vocabulary together enable effective text understanding and generation.
Real World Analogy

Imagine reading a book in a language you are learning. You break sentences into words or parts you recognize, like familiar phrases or letters. Your vocabulary is the list of words you know, helping you understand and use the language better.

Tokenization → Breaking sentences into words or smaller parts you recognize
Types of Tokens → Recognizing whole words, parts of words, or letters depending on your skill
Vocabulary → The list of words and phrases you know in the language
Why Tokenization and Vocabulary Matter → How breaking down text and knowing words helps you understand and speak better
Diagram
Diagram
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│   Text      │─────▶│ Tokenization│─────▶│  Tokens     │
└─────────────┘      └─────────────┘      └─────────────┘
                                   │
                                   ▼
                            ┌─────────────┐
                            │ Vocabulary  │
                            └─────────────┘
This diagram shows text being broken into tokens by tokenization, which are then matched to a vocabulary.
Key Facts
TokenA small piece of text such as a word, subword, or character used in processing language.
TokenizationThe process of splitting text into tokens for easier analysis by computers.
VocabularyThe complete set of tokens that a language model can recognize and use.
Subword TokenA token that represents part of a word to handle rare or new words better.
Word TokenA token that corresponds to a whole word in the text.
Common Confusions
Tokenization always splits text into words only.
Tokenization always splits text into words only. Tokenization can split text into words, subwords, or characters depending on the method used.
Vocabulary contains all possible words in a language.
Vocabulary contains all possible words in a language. Vocabulary only includes tokens the model was trained on, which may not cover every word in the language.
Summary
Tokenization breaks text into smaller pieces called tokens to help computers process language.
Tokens can be whole words, parts of words, or characters depending on the approach.
Vocabulary is the set of tokens a model knows and uses to understand and generate text.

Practice

(1/5)
1. What does tokenization do in natural language processing?
easy
A. Converts tokens into images
B. Breaks text into smaller pieces called tokens
C. Removes all punctuation from text
D. Combines multiple texts into one

Solution

  1. Step 1: Understand the role of tokenization

    Tokenization splits text into smaller parts called tokens, like words or subwords.
  2. Step 2: Compare options with tokenization definition

    Only Breaks text into smaller pieces called tokens correctly describes breaking text into tokens.
  3. Final Answer:

    Breaks text into smaller pieces called tokens -> Option B
  4. Quick Check:

    Tokenization = splitting text [OK]
Hint: Tokenization means splitting text into pieces [OK]
Common Mistakes:
  • Thinking tokenization changes text to images
  • Confusing tokenization with removing punctuation
  • Believing tokenization merges texts
2. Which of the following is the correct way to represent a token ID in Python?
easy
A. token_id = 'word'
B. token_id = {word: 1}
C. token_id = [word]
D. token_id = 123

Solution

  1. Step 1: Understand token ID representation

    Token IDs are numbers representing tokens, so they should be integers.
  2. Step 2: Check each option's type

    token_id = 123 assigns an integer 123, which is correct. Others use strings, lists, or dictionaries incorrectly.
  3. Final Answer:

    token_id = 123 -> Option D
  4. Quick Check:

    Token ID = number [OK]
Hint: Token IDs are numbers, not words or lists [OK]
Common Mistakes:
  • Using strings instead of numbers for token IDs
  • Confusing token IDs with token text
  • Using lists or dictionaries wrongly
3. Given the vocabulary {'hello': 1, 'world': 2, '!': 3}, what is the token ID list for the text 'hello world!'?
medium
A. [1, 2, 3]
B. [0, 1, 2]
C. ['hello', 'world', '!']
D. [3, 2, 1]

Solution

  1. Step 1: Map each word to its token ID

    'hello' maps to 1, 'world' maps to 2, and '!' maps to 3 according to the vocabulary.
  2. Step 2: Create the token ID list in order

    The text 'hello world!' becomes [1, 2, 3].
  3. Final Answer:

    [1, 2, 3] -> Option A
  4. Quick Check:

    Text tokens = [1, 2, 3] [OK]
Hint: Match words to IDs in order [OK]
Common Mistakes:
  • Mixing up token order
  • Using token text instead of IDs
  • Assigning wrong IDs from vocabulary
4. What is wrong with this tokenization code snippet?
vocab = {'hi': 1, 'there': 2}
text = 'hi there'
tokens = [vocab[word] for word in text.split() if word in vocab]
medium
A. It will raise a KeyError if a word is missing
B. It correctly tokenizes the text
C. It ignores words not in vocabulary
D. It uses split() incorrectly on the text

Solution

  1. Step 1: Analyze the list comprehension

    The code splits text and includes only words found in vocab, skipping others.
  2. Step 2: Identify behavior on unknown words

    Words not in vocab are ignored, which may lose information.
  3. Final Answer:

    It ignores words not in vocabulary -> Option C
  4. Quick Check:

    Unknown words skipped = ignoring [OK]
Hint: Check if unknown words are skipped or cause errors [OK]
Common Mistakes:
  • Assuming KeyError will happen due to 'if' check
  • Thinking split() is wrong here
  • Missing that unknown words are ignored silently
5. You have a vocabulary with tokens: {'I':1, 'love':2, 'AI':3, '.':4}. How would you tokenize the sentence 'I love AI!' considering the exclamation mark is not in the vocabulary?
hard
A. Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5]
B. Replace '!' with '.' and tokenize as [1, 2, 3, 4]
C. Ignore '!' and tokenize as [1, 2, 3]
D. Raise an error because '!' is unknown

Solution

  1. Step 1: Understand vocabulary coverage

    The vocabulary lacks '!', so it must be added to handle the sentence fully.
  2. Step 2: Add '!' with a new token ID

    Assign '!' a new ID (e.g., 5) and tokenize the sentence as [1, 2, 3, 5].
  3. Final Answer:

    Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5] -> Option A
  4. Quick Check:

    Unknown token added = new ID [OK]
Hint: Add unknown tokens to vocabulary before tokenizing [OK]
Common Mistakes:
  • Ignoring unknown tokens silently
  • Replacing unknown tokens incorrectly
  • Assuming error without handling unknown tokens