Bird
Raised Fist0
NLPml~20 mins

Handling out-of-vocabulary words in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
OOV Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
What is the main purpose of using an UNK token in NLP models?
In natural language processing, when a word is not found in the model's vocabulary, an UNK token is often used. What is the main purpose of this token?
ATo remove unknown words from the input text before processing to avoid errors.
BTo replace unknown words with their synonyms from a dictionary.
CTo increase the vocabulary size dynamically during model training.
DTo represent all out-of-vocabulary words with a single placeholder so the model can handle unknown words during inference.
Attempts:
2 left
💡 Hint
Think about how the model deals with words it has never seen before.
Predict Output
intermediate
1:30remaining
What is the output of this tokenization with OOV handling?
Given the vocabulary {'hello': 1, 'world': 2} and the sentence 'hello there world', what is the tokenized output using 0 as the UNK token index?
NLP
vocab = {'hello': 1, 'world': 2}
sentence = 'hello there world'
tokens = [vocab.get(word, 0) for word in sentence.split()]
print(tokens)
A[1, 1, 2]
B[1, 0, 2]
C[0, 1, 2]
D[1, 2, 0]
Attempts:
2 left
💡 Hint
Words not in the vocabulary get the UNK token index 0.
Model Choice
advanced
2:00remaining
Which model architecture is best suited to handle out-of-vocabulary words using subword units?
You want to build an NLP model that can handle out-of-vocabulary words effectively by breaking words into smaller parts. Which model architecture or technique is best for this?
AA model using Byte Pair Encoding (BPE) or WordPiece tokenization to split words into subword units.
BA model that uses a fixed vocabulary of whole words only.
CA model that replaces all unknown words with a single UNK token without subword splitting.
DA model that ignores any word not in the vocabulary during training and inference.
Attempts:
2 left
💡 Hint
Think about how breaking words into smaller parts helps with unknown words.
Hyperparameter
advanced
1:30remaining
Which hyperparameter affects the size of the vocabulary and thus the handling of OOV words in subword tokenization?
When using Byte Pair Encoding (BPE) for tokenization, which hyperparameter directly controls the vocabulary size and impacts how many out-of-vocabulary words appear?
AThe batch size used during model training.
BThe learning rate of the model optimizer.
CThe number of merge operations performed during BPE training.
DThe dropout rate applied in the model layers.
Attempts:
2 left
💡 Hint
Think about what controls how many subword units are created.
Metrics
expert
2:00remaining
How does the presence of many out-of-vocabulary words affect the model's perplexity on a test set?
You evaluate a language model on a test set containing many out-of-vocabulary (OOV) words. How does this typically affect the model's perplexity metric?
APerplexity increases because the model struggles to predict unknown words, indicating worse performance.
BPerplexity becomes zero because the model replaces all OOV words with UNK tokens.
CPerplexity remains unchanged because OOV words are ignored during evaluation.
DPerplexity decreases because unknown words are easier to predict.
Attempts:
2 left
💡 Hint
Think about how unknown words affect prediction confidence.

Practice

(1/5)
1. What is the main purpose of using an <UNK> token in natural language processing?
easy
A. To separate words in a sentence
B. To mark the end of a sentence
C. To represent words not seen during training
D. To highlight important keywords

Solution

  1. Step 1: Understand the role of <UNK> token

    The <UNK> token is used to replace words that the model has not seen during training, known as out-of-vocabulary words.
  2. Step 2: Identify the correct purpose

    Since <UNK> stands for unknown words, it helps the model handle new or rare words by treating them as a single token.
  3. Final Answer:

    To represent words not seen during training -> Option C
  4. Quick Check:

    <UNK> = unknown words [OK]
Hint: Think of <UNK> as a placeholder for unknown words [OK]
Common Mistakes:
  • Confusing <UNK> with sentence delimiters
  • Using <UNK> for common words
  • Thinking <UNK> highlights keywords
2. Which of the following is the correct way to replace out-of-vocabulary words with <UNK> in a Python list of tokens named tokens given a vocabulary set vocab?
easy
A. tokens = [word if word in vocab else '<UNK>' for word in tokens]
B. tokens = [word for word in tokens if word in vocab else '<UNK>']
C. tokens = [word in vocab ? word : '<UNK>' for word in tokens]
D. tokens = [word if word not in vocab else '<UNK>' for word in tokens]

Solution

  1. Step 1: Understand list comprehension syntax

    The correct Python syntax for conditional expressions inside a list comprehension is: [x if condition else y for x in list].
  2. Step 2: Apply correct condition for replacing OOV words

    We want to keep the word if it is in the vocabulary; otherwise, replace it with '<UNK>'. tokens = [word if word in vocab else '<UNK>' for word in tokens] correctly implements this logic.
  3. Final Answer:

    tokens = [word if word in vocab else '<UNK>' for word in tokens] -> Option A
  4. Quick Check:

    Correct Python conditional list comprehension [OK]
Hint: Remember: x if condition else y inside list comprehensions [OK]
Common Mistakes:
  • Using incorrect syntax like 'if-else' outside list comprehension
  • Confusing Python with other languages' ternary syntax
  • Reversing the condition logic
3. Given the following code snippet, what will be the output?
vocab = {'hello', 'world'}
tokens = ['hello', 'there', 'world']
tokens = [word if word in vocab else '<UNK>' for word in tokens]
print(tokens)
medium
A. ['hello', 'there', 'world']
B. ['hello', 'world', '<UNK>']
C. ['<UNK>', '<UNK>', '<UNK>']
D. ['hello', '<UNK>', 'world']

Solution

  1. Step 1: Check each token against the vocabulary

    'hello' is in vocab, so it stays 'hello'. 'there' is not in vocab, so it becomes '<UNK>'. 'world' is in vocab, so it stays 'world'.
  2. Step 2: Construct the resulting list

    The new tokens list is ['hello', '<UNK>', 'world'].
  3. Final Answer:

    ['hello', '<UNK>', 'world'] -> Option D
  4. Quick Check:

    Replace OOV words with <UNK> [OK]
Hint: Replace words not in vocab with <UNK> [OK]
Common Mistakes:
  • Not replacing 'there' because of misunderstanding
  • Replacing all words regardless of vocab
  • Confusing list order in output
4. The following code is intended to replace out-of-vocabulary words with <UNK>. What is the error?
vocab = {'cat', 'dog'}
tokens = ['cat', 'bird', 'dog']
tokens = [word if word not in vocab else '<UNK>' for word in tokens]
print(tokens)
medium
A. The vocabulary should be a list, not a set
B. The condition is reversed; it replaces in-vocab words instead of OOV
C. The list comprehension syntax is invalid
D. The print statement is missing parentheses

Solution

  1. Step 1: Analyze the condition in list comprehension

    The condition word if word not in vocab else '<UNK>' means words NOT in vocab stay as they are, and words IN vocab become '<UNK>'. This is the opposite of the intended behavior.
  2. Step 2: Identify the correct logic

    We want to keep words in vocab and replace words not in vocab with '<UNK>'. So the condition should be word if word in vocab else '<UNK>'.
  3. Final Answer:

    The condition is reversed; it replaces in-vocab words instead of OOV -> Option B
  4. Quick Check:

    Correct condition keeps vocab words, replaces others [OK]
Hint: Check if condition matches intended keep-or-replace logic [OK]
Common Mistakes:
  • Mixing up 'in' and 'not in' in conditions
  • Assuming set vs list affects membership test
  • Ignoring Python 3 print syntax
5. You have a pretrained word embedding model that does not include the word 'unicorn'. Which approach best helps your model handle this out-of-vocabulary word during inference?
hard
A. Use subword tokenization to break 'unicorn' into known parts
B. Ignore 'unicorn' and remove it from the input
C. Add 'unicorn' to the vocabulary without retraining
D. Replace 'unicorn' with <UNK> token embedding

Solution

  1. Step 1: Understand limitations of <UNK> token

    Replacing with <UNK> loses specific meaning, which may reduce model accuracy.
  2. Step 2: Consider subword tokenization benefits

    Subword tokenization breaks unknown words into smaller known units, allowing the model to infer meaning from parts.
  3. Step 3: Evaluate other options

    Ignoring the word loses information; adding it without retraining is not feasible; subword tokenization is the best practical approach.
  4. Final Answer:

    Use subword tokenization to break 'unicorn' into known parts -> Option A
  5. Quick Check:

    Subword methods handle OOV words better than <UNK> [OK]
Hint: Break unknown words into smaller known pieces with subword tokenization [OK]
Common Mistakes:
  • Thinking <UNK> always preserves meaning
  • Trying to add words without retraining embeddings
  • Removing unknown words loses important info