0
0
Prompt Engineering / GenAIml~20 mins

Tokenization and vocabulary in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Tokenization and vocabulary
Problem:You want to prepare text data for a language model by splitting sentences into tokens and building a vocabulary. Currently, the tokenizer splits on spaces only, which causes issues with punctuation and unknown words.
Current Metrics:Tokenization accuracy: 75% (measured by correct token splits compared to a reference tokenizer). Vocabulary coverage: 60% (percentage of words in test text found in vocabulary).
Issue:The tokenizer is too simple, causing poor token splits and a small vocabulary that misses many words. This leads to poor model input quality.
Your Task
Improve tokenization accuracy to at least 90% and increase vocabulary coverage to at least 85%.
You must keep the tokenizer rule-based (no pretrained models).
You cannot use external libraries beyond Python standard libraries.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import re

class SimpleTokenizer:
    def __init__(self):
        self.vocab = set()

    def tokenize(self, text):
        # Lowercase and strip
        text = text.lower().strip()
        # Split on spaces and punctuation using regex
        tokens = re.findall(r"\b\w+\b", text)
        return tokens

    def build_vocab(self, texts):
        vocab = set()
        for text in texts:
            tokens = self.tokenize(text)
            vocab.update(tokens)
        self.vocab = vocab
        return vocab

# Example usage
texts = [
    "Hello, world! This is a test.",
    "Tokenization is important for NLP.",
    "Let's improve the tokenizer."
]

# Initialize tokenizer
tokenizer = SimpleTokenizer()

# Build vocabulary
vocab = tokenizer.build_vocab(texts)

# Test tokenization accuracy and vocab coverage
reference_tokens = [
    ['hello', 'world', 'this', 'is', 'a', 'test'],
    ['tokenization', 'is', 'important', 'for', 'nlp'],
    ['let', 's', 'improve', 'the', 'tokenizer']
]

def tokenization_accuracy(tokenizer, texts, reference_tokens):
    correct = 0
    total = 0
    for text, ref in zip(texts, reference_tokens):
        pred = tokenizer.tokenize(text)
        total += len(ref)
        for t1, t2 in zip(pred, ref):
            if t1 == t2:
                correct += 1
    return correct / total * 100

def vocab_coverage(tokenizer, texts):
    total_words = 0
    known_words = 0
    for text in texts:
        tokens = tokenizer.tokenize(text)
        total_words += len(tokens)
        known_words += sum(1 for t in tokens if t in tokenizer.vocab)
    return known_words / total_words * 100

accuracy = tokenization_accuracy(tokenizer, texts, reference_tokens)
coverage = vocab_coverage(tokenizer, texts)

print(f"Tokenization accuracy: {accuracy:.1f}%")
print(f"Vocabulary coverage: {coverage:.1f}%")
Added regex-based tokenization to split on word boundaries, handling punctuation.
Lowercased text to normalize tokens.
Built vocabulary from all tokens in the dataset.
Measured tokenization accuracy against reference tokens.
Measured vocabulary coverage as percentage of known tokens.
Results Interpretation

Before: Tokenization accuracy: 75%, Vocabulary coverage: 60%

After: Tokenization accuracy: 100.0%, Vocabulary coverage: 100.0%

Using simple regex to split text on word boundaries and normalizing text greatly improves tokenization quality and vocabulary coverage, which are crucial for good language model input.
Bonus Experiment
Try adding a rule to split contractions like "let's" into "let" and "'s" to improve tokenization further.
💡 Hint
Use regex patterns to detect apostrophes inside words and split them accordingly.