Challenge - 5 Problems

🎖️

Tokenization Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Understanding Tokenization Types

Which of the following best describes subword tokenization in natural language processing?

ASplitting text into individual characters only

BBreaking text into whole words without splitting

CDividing text into smaller units than words, like syllables or parts of words

DIgnoring spaces and treating the entire text as one token

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of Tokenization Code

What is the output of the following Python code using a simple whitespace tokenizer?

Prompt Engineering / GenAI

text = "Machine learning is fun"
tokens = text.split()
print(tokens)

A['Machine_learning_is_fun']

B['Machine learning is fun']

C['M', 'a', 'c', 'h', 'i', 'n', 'e']

D['Machine', 'learning', 'is', 'fun']

Attempts:

2 left

❓ Model Choice

advanced

2:00remaining

Choosing Vocabulary Size for Tokenization

You want to train a language model on a large dataset with many rare words. Which vocabulary size is best to balance coverage and model size?

AModerate vocabulary size (e.g., 30,000 tokens) with subword tokenization

BVery small vocabulary (e.g., 500 tokens) to reduce model size

CVery large vocabulary (e.g., 100,000 tokens) to cover all words exactly

DVocabulary size does not affect model performance

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Evaluating Tokenizer Coverage

You have a tokenizer vocabulary of 10,000 tokens. After tokenizing a test set of 1,000 words, 50 words are split into multiple tokens. What is the approximate tokenization coverage percentage?

A95%

B5%

C50%

D100%

Attempts:

2 left

🔧 Debug

expert

2:00remaining

Identifying Tokenization Bug in Code

What error does the following code raise when trying to tokenize text using a vocabulary dictionary?

vocab = {"hello": 1, "world": 2}
text = "hello unknown world"
tokens = [vocab[word] for word in text.split()]
print(tokens)

ANo error, output is [1, 0, 2]

BKeyError because 'unknown' is not in vocab

CTypeError because vocab is not iterable

DSyntaxError due to missing colon

Attempts:

2 left

Practice

(1/5)

1. What does tokenization do in natural language processing?

easy

A. Converts tokens into images

B. Breaks text into smaller pieces called tokens

C. Removes all punctuation from text

D. Combines multiple texts into one

Tokenization and vocabulary in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of tokenization

Step 2: Compare options with tokenization definition

Final Answer:

Quick Check:

Solution

Step 1: Understand token ID representation

Step 2: Check each option's type

Final Answer:

Quick Check:

Solution

Step 1: Map each word to its token ID

Step 2: Create the token ID list in order

Final Answer:

Quick Check:

Solution

Step 1: Analyze the list comprehension

Step 2: Identify behavior on unknown words

Final Answer:

Quick Check:

Solution

Step 1: Understand vocabulary coverage

Step 2: Add '!' with a new token ID

Final Answer:

Quick Check: