In natural language processing, why do we often limit the vocabulary size when building models?
Think about how large vocabularies affect model size and training speed.
Limiting vocabulary size helps keep the model smaller and faster by focusing on frequent words, which carry most meaning. Rare words are often replaced with a special token.
Given the code below that filters tokens by frequency, what is the length of the resulting vocabulary?
from collections import Counter texts = ['apple banana apple', 'banana orange apple', 'orange banana banana'] tokens = ' '.join(texts).split() counter = Counter(tokens) vocab = {word for word, freq in counter.items() if freq >= 3} print(len(vocab))
Count how many words appear at least 3 times.
'apple' appears 3 times, 'banana' 4 times, 'orange' 2 times. Only 'apple' and 'banana' meet the frequency threshold.
You want to control vocabulary size in a text classification task. Which tokenization method helps best to keep vocabulary size manageable while capturing meaningful subword units?
Think about methods that break words into smaller parts to reduce vocabulary size.
BPE breaks words into subword units, reducing vocabulary size while preserving meaning better than character-level tokenization.
When training a language model, how does increasing vocabulary size generally affect the model's perplexity on test data?
Consider the trade-off between vocabulary coverage and data sparsity.
Very large vocabularies cause many rare words, leading to sparse data and higher perplexity. Moderate vocabularies balance coverage and data density.
What error does the following code raise when trying to limit vocabulary size by frequency?
from collections import Counter
texts = ['cat dog cat', 'dog mouse cat', 'mouse dog dog']
tokens = ' '.join(texts).split()
counter = Counter(tokens)
vocab = {word for word in counter if counter[word] > 2}
print(vocab[0])Look at how vocab is accessed after creation.
vocab is a set, which cannot be accessed by index like a list. Trying vocab[0] causes a TypeError.