Prompt Engineering / GenAIml~20 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Tokenization and vocabulary

Problem:You want to prepare text data for a language model by splitting sentences into tokens and building a vocabulary. Currently, the tokenizer splits on spaces only, which causes issues with punctuation and unknown words.

Current Metrics:Tokenization accuracy: 75% (measured by correct token splits compared to a reference tokenizer). Vocabulary coverage: 60% (percentage of words in test text found in vocabulary).

Issue:The tokenizer is too simple, causing poor token splits and a small vocabulary that misses many words. This leads to poor model input quality.

Your Task

Improve tokenization accuracy to at least 90% and increase vocabulary coverage to at least 85%.

You must keep the tokenizer rule-based (no pretrained models).

You cannot use external libraries beyond Python standard libraries.

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

import re

class SimpleTokenizer:
    def __init__(self):
        self.vocab = set()

    def tokenize(self, text):
        # Lowercase and strip
        text = text.lower().strip()
        # Split on spaces and punctuation using regex
        tokens = re.findall(r"\b\w+\b", text)
        return tokens

    def build_vocab(self, texts):
        vocab = set()
        for text in texts:
            tokens = self.tokenize(text)
            vocab.update(tokens)
        self.vocab = vocab
        return vocab

# Example usage
texts = [
    "Hello, world! This is a test.",
    "Tokenization is important for NLP.",
    "Let's improve the tokenizer."
]

# Initialize tokenizer
tokenizer = SimpleTokenizer()

# Build vocabulary
vocab = tokenizer.build_vocab(texts)

# Test tokenization accuracy and vocab coverage
reference_tokens = [
    ['hello', 'world', 'this', 'is', 'a', 'test'],
    ['tokenization', 'is', 'important', 'for', 'nlp'],
    ['let', 's', 'improve', 'the', 'tokenizer']
]

def tokenization_accuracy(tokenizer, texts, reference_tokens):
    correct = 0
    total = 0
    for text, ref in zip(texts, reference_tokens):
        pred = tokenizer.tokenize(text)
        total += len(ref)
        for t1, t2 in zip(pred, ref):
            if t1 == t2:
                correct += 1
    return correct / total * 100

def vocab_coverage(tokenizer, texts):
    total_words = 0
    known_words = 0
    for text in texts:
        tokens = tokenizer.tokenize(text)
        total_words += len(tokens)
        known_words += sum(1 for t in tokens if t in tokenizer.vocab)
    return known_words / total_words * 100

accuracy = tokenization_accuracy(tokenizer, texts, reference_tokens)
coverage = vocab_coverage(tokenizer, texts)

print(f"Tokenization accuracy: {accuracy:.1f}%")
print(f"Vocabulary coverage: {coverage:.1f}%")

Added regex-based tokenization to split on word boundaries, handling punctuation.

Lowercased text to normalize tokens.

Built vocabulary from all tokens in the dataset.

Measured tokenization accuracy against reference tokens.

Measured vocabulary coverage as percentage of known tokens.

Results Interpretation

Before: Tokenization accuracy: 75%, Vocabulary coverage: 60%

After: Tokenization accuracy: 100.0%, Vocabulary coverage: 100.0%

Using simple regex to split text on word boundaries and normalizing text greatly improves tokenization quality and vocabulary coverage, which are crucial for good language model input.

Bonus Experiment

Try adding a rule to split contractions like "let's" into "let" and "'s" to improve tokenization further.

💡 Hint

Use regex patterns to detect apostrophes inside words and split them accordingly.

Practice

(1/5)

1. What does tokenization do in natural language processing?

easy

A. Converts tokens into images

B. Breaks text into smaller pieces called tokens

C. Removes all punctuation from text

D. Combines multiple texts into one

Tokenization and vocabulary in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of tokenization

Step 2: Compare options with tokenization definition

Final Answer:

Quick Check:

Solution

Step 1: Understand token ID representation

Step 2: Check each option's type

Final Answer:

Quick Check:

Solution

Step 1: Map each word to its token ID

Step 2: Create the token ID list in order

Final Answer:

Quick Check:

Solution

Step 1: Analyze the list comprehension

Step 2: Identify behavior on unknown words

Final Answer:

Quick Check:

Solution

Step 1: Understand vocabulary coverage

Step 2: Add '!' with a new token ID

Final Answer:

Quick Check: