0
0
Prompt Engineering / GenAIml~20 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Tokenization Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding Tokenization Types

Which of the following best describes subword tokenization in natural language processing?

ASplitting text into individual characters only
BBreaking text into whole words without splitting
CDividing text into smaller units than words, like syllables or parts of words
DIgnoring spaces and treating the entire text as one token
Attempts:
2 left
💡 Hint

Think about how tokenization helps handle unknown or rare words by breaking them down.

Predict Output
intermediate
2:00remaining
Output of Tokenization Code

What is the output of the following Python code using a simple whitespace tokenizer?

Prompt Engineering / GenAI
text = "Machine learning is fun"
tokens = text.split()
print(tokens)
A['Machine_learning_is_fun']
B['Machine learning is fun']
C['M', 'a', 'c', 'h', 'i', 'n', 'e']
D['Machine', 'learning', 'is', 'fun']
Attempts:
2 left
💡 Hint

Remember what the split() method does by default.

Model Choice
advanced
2:00remaining
Choosing Vocabulary Size for Tokenization

You want to train a language model on a large dataset with many rare words. Which vocabulary size is best to balance coverage and model size?

AModerate vocabulary size (e.g., 30,000 tokens) with subword tokenization
BVery small vocabulary (e.g., 500 tokens) to reduce model size
CVery large vocabulary (e.g., 100,000 tokens) to cover all words exactly
DVocabulary size does not affect model performance
Attempts:
2 left
💡 Hint

Think about how subword tokenization helps with rare words and model efficiency.

Metrics
advanced
2:00remaining
Evaluating Tokenizer Coverage

You have a tokenizer vocabulary of 10,000 tokens. After tokenizing a test set of 1,000 words, 50 words are split into multiple tokens. What is the approximate tokenization coverage percentage?

A95%
B5%
C50%
D100%
Attempts:
2 left
💡 Hint

Coverage means how many words are represented as single tokens.

🔧 Debug
expert
2:00remaining
Identifying Tokenization Bug in Code

What error does the following code raise when trying to tokenize text using a vocabulary dictionary?

vocab = {"hello": 1, "world": 2}
text = "hello unknown world"
tokens = [vocab[word] for word in text.split()]
print(tokens)
ANo error, output is [1, 0, 2]
BKeyError because 'unknown' is not in vocab
CTypeError because vocab is not iterable
DSyntaxError due to missing colon
Attempts:
2 left
💡 Hint

Check what happens when a word is not found in the dictionary keys.