Recall & Review

beginner

What is tokenization in natural language processing?

Tokenization is the process of breaking text into smaller pieces called tokens, such as words or subwords, to help machines understand and work with language.

Click to reveal answer

beginner

Why do we need a vocabulary in language models?

A vocabulary is a list of all tokens the model knows. It helps the model convert text into numbers it can process and generate text by choosing tokens from this list.

Click to reveal answer

intermediate

What is the difference between word-level and subword-level tokenization?

Word-level tokenization splits text into whole words, while subword-level breaks words into smaller parts. Subword tokenization helps handle rare or new words better.

Click to reveal answer

intermediate

How does tokenization affect the size of the vocabulary?

Finer tokenization (like subwords) creates a smaller vocabulary because it reuses parts of words, while word-level tokenization needs a bigger vocabulary to cover all words.

Click to reveal answer

beginner

What happens if a token is not in the vocabulary during model use?

If a token is missing, the model usually replaces it with a special unknown token or breaks it into smaller known tokens to still understand it.

Click to reveal answer

What is the main goal of tokenization?

ATo remove punctuation from text

BTo translate text into another language

CTo compress text into fewer characters

DTo split text into smaller pieces called tokens

Which tokenization method helps handle new or rare words better?

ASubword-level tokenization

BWord-level tokenization

CCharacter-level tokenization

DSentence-level tokenization

What does a vocabulary in a language model contain?

AAll possible sentences

BAll tokens the model can recognize

CAll grammar rules

DAll training data

What is a common way models handle tokens not in their vocabulary?

AIgnore the token completely

BAdd the token to the vocabulary instantly

CReplace with an unknown token or split into smaller tokens

DTranslate the token to another language

Which tokenization approach usually results in the largest vocabulary?

AWord-level tokenization

BSubword-level tokenization

CCharacter-level tokenization

DSentence-level tokenization

Explain in your own words what tokenization is and why it is important for language models.

Describe the role of vocabulary in a language model and what happens when the model encounters a token not in its vocabulary.

Practice

(1/5)

1. What does tokenization do in natural language processing?

easy

A. Converts tokens into images

B. Breaks text into smaller pieces called tokens

C. Removes all punctuation from text

D. Combines multiple texts into one

Tokenization and vocabulary in Prompt Engineering / GenAI - Cheat Sheet & Quick Revision

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of tokenization

Step 2: Compare options with tokenization definition

Final Answer:

Quick Check:

Solution

Step 1: Understand token ID representation

Step 2: Check each option's type

Final Answer:

Quick Check:

Solution

Step 1: Map each word to its token ID

Step 2: Create the token ID list in order

Final Answer:

Quick Check:

Solution

Step 1: Analyze the list comprehension

Step 2: Identify behavior on unknown words

Final Answer:

Quick Check:

Solution

Step 1: Understand vocabulary coverage

Step 2: Add '!' with a new token ID

Final Answer:

Quick Check: