0
0
NLPml~20 mins

Vocabulary size control in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Vocabulary Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
Why limit vocabulary size in NLP models?

In natural language processing, why do we often limit the vocabulary size when building models?

ATo reduce model complexity and memory usage while focusing on the most frequent words
BTo make the model ignore common words like 'the' and 'and'
CTo increase the number of rare words the model can learn
DTo ensure the model only learns from stop words
Attempts:
2 left
💡 Hint

Think about how large vocabularies affect model size and training speed.

Predict Output
intermediate
1:30remaining
Output of vocabulary size after filtering

Given the code below that filters tokens by frequency, what is the length of the resulting vocabulary?

NLP
from collections import Counter

texts = ['apple banana apple', 'banana orange apple', 'orange banana banana']
tokens = ' '.join(texts).split()
counter = Counter(tokens)
vocab = {word for word, freq in counter.items() if freq >= 3}
print(len(vocab))
A2
B3
C1
D0
Attempts:
2 left
💡 Hint

Count how many words appear at least 3 times.

Model Choice
advanced
2:00remaining
Choosing a tokenization method to control vocabulary size

You want to control vocabulary size in a text classification task. Which tokenization method helps best to keep vocabulary size manageable while capturing meaningful subword units?

ACharacter-level tokenization
BWord-level tokenization without filtering
CSentence-level tokenization
DByte Pair Encoding (BPE) subword tokenization
Attempts:
2 left
💡 Hint

Think about methods that break words into smaller parts to reduce vocabulary size.

Metrics
advanced
2:00remaining
Effect of vocabulary size on model perplexity

When training a language model, how does increasing vocabulary size generally affect the model's perplexity on test data?

APerplexity always decreases as vocabulary size increases
BPerplexity may increase due to data sparsity with very large vocabularies
CPerplexity always increases as vocabulary size decreases
DPerplexity remains unchanged regardless of vocabulary size
Attempts:
2 left
💡 Hint

Consider the trade-off between vocabulary coverage and data sparsity.

🔧 Debug
expert
2:00remaining
Identifying the bug in vocabulary size control code

What error does the following code raise when trying to limit vocabulary size by frequency?

from collections import Counter
texts = ['cat dog cat', 'dog mouse cat', 'mouse dog dog']
tokens = ' '.join(texts).split()
counter = Counter(tokens)
vocab = {word for word in counter if counter[word] > 2}
print(vocab[0])
ASyntaxError: invalid syntax
BKeyError: 0
CTypeError: 'set' object is not subscriptable
DNo error, prints 'cat'
Attempts:
2 left
💡 Hint

Look at how vocab is accessed after creation.