0
0
NLPml~5 mins

Vocabulary size control in NLP - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is vocabulary size control in NLP?
Vocabulary size control is the process of limiting or managing the number of unique words or tokens used in a language model to improve efficiency and reduce complexity.
Click to reveal answer
beginner
Why do we need to control vocabulary size in NLP models?
Controlling vocabulary size helps reduce memory use, speeds up training, and avoids rare words that add noise, making models more efficient and generalizable.
Click to reveal answer
intermediate
Name two common methods to control vocabulary size.
1. Limiting vocabulary to the most frequent words. 2. Using subword units like Byte Pair Encoding (BPE) to break rare words into smaller parts.
Click to reveal answer
intermediate
How does Byte Pair Encoding (BPE) help with vocabulary size control?
BPE merges frequent pairs of characters or subwords to create a compact vocabulary that can represent rare words as combinations of smaller units, reducing total vocabulary size.
Click to reveal answer
advanced
What is the trade-off when choosing a smaller vocabulary size?
A smaller vocabulary reduces model size and speeds training but may increase the number of tokens per sentence, potentially making sequences longer and harder to process.
Click to reveal answer
What happens if vocabulary size is too large in an NLP model?
AThe model uses more memory and trains slower
BThe model becomes faster and smaller
CThe model ignores rare words
DThe model always improves accuracy
Which method breaks words into smaller parts to reduce vocabulary size?
AOne-hot encoding
BStop word removal
CLemmatization
DByte Pair Encoding (BPE)
Limiting vocabulary to the most frequent words helps because:
AIt reduces noise and model size
BRare words are always unimportant
CIt increases the number of tokens per sentence
DIt makes the model ignore common words
What is a downside of using a very small vocabulary?
AMore memory usage
BLonger token sequences
CSlower training
DIgnoring frequent words
Vocabulary size control is important because:
AIt always improves model accuracy
BIt removes all rare words
CIt balances model size and performance
DIt makes models ignore punctuation
Explain vocabulary size control and why it matters in NLP models.
Think about how many words a model knows and how that affects speed and memory.
You got /3 concepts.
    Describe two methods to control vocabulary size and their pros and cons.
    Consider how each method handles rare words and vocabulary size.
    You got /3 concepts.