Discover how breaking words into tiny pieces unlocks the magic of language understanding for AI!
Why Tokenization and vocabulary in Prompt Engineering / GenAI? - Purpose & Use Cases
Start learning this pattern below
Jump into concepts and practice - no test required
Imagine trying to teach a computer to understand a whole book by reading it letter by letter without any breaks or clues.
You have to manually split sentences into words and guess meanings without any help.
Doing this by hand is slow and confusing.
It's easy to make mistakes splitting words or missing important parts.
Without a clear list of known words, the computer gets lost and can't learn well.
Tokenization breaks text into small, meaningful pieces automatically.
Vocabulary is the list of these pieces the computer knows.
Together, they help the computer read and understand language clearly and quickly.
text = 'Hello world' words = [] for char in text: # manually guess word boundaries pass
tokens = tokenizer.tokenize('Hello world')
vocab = tokenizer.get_vocab()It lets machines quickly and accurately turn language into pieces they can learn from and use.
When you talk to a voice assistant, tokenization helps it understand your words and respond correctly.
Tokenization splits text into manageable parts automatically.
Vocabulary is the known list of these parts for the machine.
Together, they make language easy for machines to process and learn.
Practice
Solution
Step 1: Understand the role of tokenization
Tokenization splits text into smaller parts called tokens, like words or subwords.Step 2: Compare options with tokenization definition
Only Breaks text into smaller pieces called tokens correctly describes breaking text into tokens.Final Answer:
Breaks text into smaller pieces called tokens -> Option BQuick Check:
Tokenization = splitting text [OK]
- Thinking tokenization changes text to images
- Confusing tokenization with removing punctuation
- Believing tokenization merges texts
Solution
Step 1: Understand token ID representation
Token IDs are numbers representing tokens, so they should be integers.Step 2: Check each option's type
token_id = 123 assigns an integer 123, which is correct. Others use strings, lists, or dictionaries incorrectly.Final Answer:
token_id = 123 -> Option DQuick Check:
Token ID = number [OK]
- Using strings instead of numbers for token IDs
- Confusing token IDs with token text
- Using lists or dictionaries wrongly
'hello world!'?Solution
Step 1: Map each word to its token ID
'hello' maps to 1, 'world' maps to 2, and '!' maps to 3 according to the vocabulary.Step 2: Create the token ID list in order
The text 'hello world!' becomes [1, 2, 3].Final Answer:
[1, 2, 3] -> Option AQuick Check:
Text tokens = [1, 2, 3] [OK]
- Mixing up token order
- Using token text instead of IDs
- Assigning wrong IDs from vocabulary
vocab = {'hi': 1, 'there': 2}
text = 'hi there'
tokens = [vocab[word] for word in text.split() if word in vocab]Solution
Step 1: Analyze the list comprehension
The code splits text and includes only words found in vocab, skipping others.Step 2: Identify behavior on unknown words
Words not in vocab are ignored, which may lose information.Final Answer:
It ignores words not in vocabulary -> Option CQuick Check:
Unknown words skipped = ignoring [OK]
- Assuming KeyError will happen due to 'if' check
- Thinking split() is wrong here
- Missing that unknown words are ignored silently
'I love AI!' considering the exclamation mark is not in the vocabulary?Solution
Step 1: Understand vocabulary coverage
The vocabulary lacks '!', so it must be added to handle the sentence fully.Step 2: Add '!' with a new token ID
Assign '!' a new ID (e.g., 5) and tokenize the sentence as [1, 2, 3, 5].Final Answer:
Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5] -> Option AQuick Check:
Unknown token added = new ID [OK]
- Ignoring unknown tokens silently
- Replacing unknown tokens incorrectly
- Assuming error without handling unknown tokens
