Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Tokenization and vocabulary
Problem:You want to prepare text data for a language model by splitting sentences into tokens and building a vocabulary. Currently, the tokenizer splits on spaces only, which causes issues with punctuation and unknown words.
Current Metrics:Tokenization accuracy: 75% (measured by correct token splits compared to a reference tokenizer). Vocabulary coverage: 60% (percentage of words in test text found in vocabulary).
Issue:The tokenizer is too simple, causing poor token splits and a small vocabulary that misses many words. This leads to poor model input quality.
Your Task
Improve tokenization accuracy to at least 90% and increase vocabulary coverage to at least 85%.
You must keep the tokenizer rule-based (no pretrained models).
You cannot use external libraries beyond Python standard libraries.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import re

class SimpleTokenizer:
    def __init__(self):
        self.vocab = set()

    def tokenize(self, text):
        # Lowercase and strip
        text = text.lower().strip()
        # Split on spaces and punctuation using regex
        tokens = re.findall(r"\b\w+\b", text)
        return tokens

    def build_vocab(self, texts):
        vocab = set()
        for text in texts:
            tokens = self.tokenize(text)
            vocab.update(tokens)
        self.vocab = vocab
        return vocab

# Example usage
texts = [
    "Hello, world! This is a test.",
    "Tokenization is important for NLP.",
    "Let's improve the tokenizer."
]

# Initialize tokenizer
tokenizer = SimpleTokenizer()

# Build vocabulary
vocab = tokenizer.build_vocab(texts)

# Test tokenization accuracy and vocab coverage
reference_tokens = [
    ['hello', 'world', 'this', 'is', 'a', 'test'],
    ['tokenization', 'is', 'important', 'for', 'nlp'],
    ['let', 's', 'improve', 'the', 'tokenizer']
]

def tokenization_accuracy(tokenizer, texts, reference_tokens):
    correct = 0
    total = 0
    for text, ref in zip(texts, reference_tokens):
        pred = tokenizer.tokenize(text)
        total += len(ref)
        for t1, t2 in zip(pred, ref):
            if t1 == t2:
                correct += 1
    return correct / total * 100

def vocab_coverage(tokenizer, texts):
    total_words = 0
    known_words = 0
    for text in texts:
        tokens = tokenizer.tokenize(text)
        total_words += len(tokens)
        known_words += sum(1 for t in tokens if t in tokenizer.vocab)
    return known_words / total_words * 100

accuracy = tokenization_accuracy(tokenizer, texts, reference_tokens)
coverage = vocab_coverage(tokenizer, texts)

print(f"Tokenization accuracy: {accuracy:.1f}%")
print(f"Vocabulary coverage: {coverage:.1f}%")
Added regex-based tokenization to split on word boundaries, handling punctuation.
Lowercased text to normalize tokens.
Built vocabulary from all tokens in the dataset.
Measured tokenization accuracy against reference tokens.
Measured vocabulary coverage as percentage of known tokens.
Results Interpretation

Before: Tokenization accuracy: 75%, Vocabulary coverage: 60%

After: Tokenization accuracy: 100.0%, Vocabulary coverage: 100.0%

Using simple regex to split text on word boundaries and normalizing text greatly improves tokenization quality and vocabulary coverage, which are crucial for good language model input.
Bonus Experiment
Try adding a rule to split contractions like "let's" into "let" and "'s" to improve tokenization further.
💡 Hint
Use regex patterns to detect apostrophes inside words and split them accordingly.

Practice

(1/5)
1. What does tokenization do in natural language processing?
easy
A. Converts tokens into images
B. Breaks text into smaller pieces called tokens
C. Removes all punctuation from text
D. Combines multiple texts into one

Solution

  1. Step 1: Understand the role of tokenization

    Tokenization splits text into smaller parts called tokens, like words or subwords.
  2. Step 2: Compare options with tokenization definition

    Only Breaks text into smaller pieces called tokens correctly describes breaking text into tokens.
  3. Final Answer:

    Breaks text into smaller pieces called tokens -> Option B
  4. Quick Check:

    Tokenization = splitting text [OK]
Hint: Tokenization means splitting text into pieces [OK]
Common Mistakes:
  • Thinking tokenization changes text to images
  • Confusing tokenization with removing punctuation
  • Believing tokenization merges texts
2. Which of the following is the correct way to represent a token ID in Python?
easy
A. token_id = 'word'
B. token_id = {word: 1}
C. token_id = [word]
D. token_id = 123

Solution

  1. Step 1: Understand token ID representation

    Token IDs are numbers representing tokens, so they should be integers.
  2. Step 2: Check each option's type

    token_id = 123 assigns an integer 123, which is correct. Others use strings, lists, or dictionaries incorrectly.
  3. Final Answer:

    token_id = 123 -> Option D
  4. Quick Check:

    Token ID = number [OK]
Hint: Token IDs are numbers, not words or lists [OK]
Common Mistakes:
  • Using strings instead of numbers for token IDs
  • Confusing token IDs with token text
  • Using lists or dictionaries wrongly
3. Given the vocabulary {'hello': 1, 'world': 2, '!': 3}, what is the token ID list for the text 'hello world!'?
medium
A. [1, 2, 3]
B. [0, 1, 2]
C. ['hello', 'world', '!']
D. [3, 2, 1]

Solution

  1. Step 1: Map each word to its token ID

    'hello' maps to 1, 'world' maps to 2, and '!' maps to 3 according to the vocabulary.
  2. Step 2: Create the token ID list in order

    The text 'hello world!' becomes [1, 2, 3].
  3. Final Answer:

    [1, 2, 3] -> Option A
  4. Quick Check:

    Text tokens = [1, 2, 3] [OK]
Hint: Match words to IDs in order [OK]
Common Mistakes:
  • Mixing up token order
  • Using token text instead of IDs
  • Assigning wrong IDs from vocabulary
4. What is wrong with this tokenization code snippet?
vocab = {'hi': 1, 'there': 2}
text = 'hi there'
tokens = [vocab[word] for word in text.split() if word in vocab]
medium
A. It will raise a KeyError if a word is missing
B. It correctly tokenizes the text
C. It ignores words not in vocabulary
D. It uses split() incorrectly on the text

Solution

  1. Step 1: Analyze the list comprehension

    The code splits text and includes only words found in vocab, skipping others.
  2. Step 2: Identify behavior on unknown words

    Words not in vocab are ignored, which may lose information.
  3. Final Answer:

    It ignores words not in vocabulary -> Option C
  4. Quick Check:

    Unknown words skipped = ignoring [OK]
Hint: Check if unknown words are skipped or cause errors [OK]
Common Mistakes:
  • Assuming KeyError will happen due to 'if' check
  • Thinking split() is wrong here
  • Missing that unknown words are ignored silently
5. You have a vocabulary with tokens: {'I':1, 'love':2, 'AI':3, '.':4}. How would you tokenize the sentence 'I love AI!' considering the exclamation mark is not in the vocabulary?
hard
A. Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5]
B. Replace '!' with '.' and tokenize as [1, 2, 3, 4]
C. Ignore '!' and tokenize as [1, 2, 3]
D. Raise an error because '!' is unknown

Solution

  1. Step 1: Understand vocabulary coverage

    The vocabulary lacks '!', so it must be added to handle the sentence fully.
  2. Step 2: Add '!' with a new token ID

    Assign '!' a new ID (e.g., 5) and tokenize the sentence as [1, 2, 3, 5].
  3. Final Answer:

    Add '!' to vocabulary with new ID and tokenize as [1, 2, 3, 5] -> Option A
  4. Quick Check:

    Unknown token added = new ID [OK]
Hint: Add unknown tokens to vocabulary before tokenizing [OK]
Common Mistakes:
  • Ignoring unknown tokens silently
  • Replacing unknown tokens incorrectly
  • Assuming error without handling unknown tokens