Practice

(1/5)

1. What does tokenization do in natural language processing?

easy

A. Converts tokens into images

B. Breaks text into smaller pieces called tokens

C. Removes all punctuation from text

D. Combines multiple texts into one

Solution

Step 1: Understand the role of tokenization
Tokenization splits text into smaller parts called tokens, like words or subwords.
Step 2: Compare options with tokenization definition
Only Breaks text into smaller pieces called tokens correctly describes breaking text into tokens.
Final Answer:
Breaks text into smaller pieces called tokens -> Option B
Quick Check:
Tokenization = splitting text [OK]

Hint: Tokenization means splitting text into pieces [OK]

Common Mistakes:

Thinking tokenization changes text to images
Confusing tokenization with removing punctuation
Believing tokenization merges texts

2. Which of the following is the correct way to represent a token ID in Python?

easy

A. token_id = 'word'

B. token_id = {word: 1}

C. token_id = [word]

D. token_id = 123

Solution

Step 1: Understand token ID representation
Token IDs are numbers representing tokens, so they should be integers.
Step 2: Check each option's type
token_id = 123 assigns an integer 123, which is correct. Others use strings, lists, or dictionaries incorrectly.
Final Answer:
token_id = 123 -> Option D
Quick Check:
Token ID = number [OK]

Hint: Token IDs are numbers, not words or lists [OK]

Common Mistakes:

Using strings instead of numbers for token IDs
Confusing token IDs with token text
Using lists or dictionaries wrongly

3. Given the vocabulary {'hello': 1, 'world': 2, '!': 3}, what is the token ID list for the text 'hello world!'?

medium

A. [1, 2, 3]

B. [0, 1, 2]

C. ['hello', 'world', '!']

D. [3, 2, 1]

Solution

Step 1: Map each word to its token ID
'hello' maps to 1, 'world' maps to 2, and '!' maps to 3 according to the vocabulary.
Step 2: Create the token ID list in order
The text 'hello world!' becomes [1, 2, 3].
Final Answer:
[1, 2, 3] -> Option A
Quick Check:
Text tokens = [1, 2, 3] [OK]

Hint: Match words to IDs in order [OK]

Common Mistakes:

Mixing up token order
Using token text instead of IDs
Assigning wrong IDs from vocabulary

4. What is wrong with this tokenization code snippet?

vocab = {'hi': 1, 'there': 2}
text = 'hi there'
tokens = [vocab[word] for word in text.split() if word in vocab]

medium

A. It will raise a KeyError if a word is missing

B. It correctly tokenizes the text

C. It ignores words not in vocabulary

D. It uses split() incorrectly on the text

Solution

Step 1: Analyze the list comprehension
The code splits text and includes only words found in vocab, skipping others.
Step 2: Identify behavior on unknown words
Words not in vocab are ignored, which may lose information.
Final Answer:
It ignores words not in vocabulary -> Option C
Quick Check:
Unknown words skipped = ignoring [OK]

Hint: Check if unknown words are skipped or cause errors [OK]

Common Mistakes:

Assuming KeyError will happen due to 'if' check
Thinking split() is wrong here
Missing that unknown words are ignored silently

5. You have a vocabulary with tokens: {'I':1, 'love':2, 'AI':3, '.':4}. How would you tokenize the sentence 'I love AI!' considering the exclamation mark is not in the vocabulary?

hard

A. Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5]

B. Replace '!' with '.' and tokenize as [1, 2, 3, 4]

C. Ignore '!' and tokenize as [1, 2, 3]

D. Raise an error because '!' is unknown

Solution

Step 1: Understand vocabulary coverage
The vocabulary lacks '!', so it must be added to handle the sentence fully.
Step 2: Add '!' with a new token ID
Assign '!' a new ID (e.g., 5) and tokenize the sentence as [1, 2, 3, 5].
Final Answer:
Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5] -> Option A
Quick Check:
Unknown token added = new ID [OK]

Hint: Add unknown tokens to vocabulary before tokenizing [OK]

Common Mistakes:

Ignoring unknown tokens silently
Replacing unknown tokens incorrectly
Assuming error without handling unknown tokens

Epoch	Loss ↓	Accuracy ↑	Observation
1	2.30	0.15	Model starts with high loss and low accuracy as it learns token patterns.
2	1.85	0.35	Loss decreases and accuracy improves as vocabulary mapping becomes clearer.
3	1.40	0.55	Model better understands token sequences, improving predictions.
4	1.10	0.70	Vocabulary usage is more accurate, loss continues to drop.
5	0.85	0.80	Model converges well on token patterns and vocabulary.

Tokenization and vocabulary in Prompt Engineering / GenAI - Model Pipeline Trace

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of tokenization

Step 2: Compare options with tokenization definition

Final Answer:

Quick Check:

Solution

Step 1: Understand token ID representation

Step 2: Check each option's type

Final Answer:

Quick Check:

Solution

Step 1: Map each word to its token ID

Step 2: Create the token ID list in order

Final Answer:

Quick Check:

Solution

Step 1: Analyze the list comprehension

Step 2: Identify behavior on unknown words

Final Answer:

Quick Check:

Solution

Step 1: Understand vocabulary coverage

Step 2: Add '!' with a new token ID

Final Answer:

Quick Check: