Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What are out-of-vocabulary (OOV) words in NLP?
OOV words are words that a model has never seen during training. They are new or rare words that do not appear in the model's vocabulary.
Click to reveal answer
beginner
Why is handling OOV words important in NLP?
Handling OOV words is important because models need to understand or process new words to work well on real-world text, which often contains unseen words.
Click to reveal answer
beginner
Name one simple method to handle OOV words.
One simple method is to replace OOV words with a special token like <UNK> (unknown), so the model treats all unknown words the same way.
Click to reveal answer
intermediate
How do subword tokenization methods help with OOV words?
Subword tokenization breaks words into smaller parts (like syllables or character groups), so even if a full word is new, its parts may be known, helping the model understand it.
Click to reveal answer
intermediate
What is the role of character-level models in handling OOV words?
Character-level models read words letter by letter, so they can build meaning from any word, even if it was never seen before, reducing OOV problems.
Click to reveal answer
What does the <UNK> token represent in NLP?
AA common stop word
BUnknown or out-of-vocabulary words
CA punctuation mark
DA named entity
✗ Incorrect
The token is used to represent words that are not in the model's vocabulary, i.e., unknown or out-of-vocabulary words.
Which method breaks words into smaller known pieces to handle OOV words?
ALemmatization
BStop word removal
CSubword tokenization
DPart-of-speech tagging
✗ Incorrect
Subword tokenization splits words into smaller parts, helping models understand new words by their known pieces.
Character-level models read words one letter at a time, allowing them to handle any word, even unseen ones.
What is a downside of replacing OOV words with <UNK> token?
AModel treats all unknown words the same, losing specific meaning
BIt increases vocabulary size
CIt slows down training
DIt requires labeled data
✗ Incorrect
Using token means the model cannot distinguish between different unknown words, losing their unique meanings.
Which of these is NOT a common way to handle OOV words?
AUsing character-level models
BReplacing with <UNK> token
CUsing subword tokenization
DIgnoring OOV words completely
✗ Incorrect
Ignoring OOV words completely is not effective because it loses information; other methods help the model understand or represent them.
Explain what out-of-vocabulary words are and why they pose a challenge in NLP.
Think about words a model never saw during training.
You got /3 concepts.
Describe at least two methods to handle out-of-vocabulary words and how they help.
Consider simple replacement and breaking words into parts.
You got /4 concepts.
Practice
(1/5)
1. What is the main purpose of using an <UNK> token in natural language processing?
easy
A. To separate words in a sentence
B. To mark the end of a sentence
C. To represent words not seen during training
D. To highlight important keywords
Solution
Step 1: Understand the role of <UNK> token
The <UNK> token is used to replace words that the model has not seen during training, known as out-of-vocabulary words.
Step 2: Identify the correct purpose
Since <UNK> stands for unknown words, it helps the model handle new or rare words by treating them as a single token.
Final Answer:
To represent words not seen during training -> Option C
Quick Check:
<UNK> = unknown words [OK]
Hint: Think of <UNK> as a placeholder for unknown words [OK]
Common Mistakes:
Confusing <UNK> with sentence delimiters
Using <UNK> for common words
Thinking <UNK> highlights keywords
2. Which of the following is the correct way to replace out-of-vocabulary words with <UNK> in a Python list of tokens named tokens given a vocabulary set vocab?
easy
A. tokens = [word if word in vocab else '<UNK>' for word in tokens]
B. tokens = [word for word in tokens if word in vocab else '<UNK>']
C. tokens = [word in vocab ? word : '<UNK>' for word in tokens]
D. tokens = [word if word not in vocab else '<UNK>' for word in tokens]
Solution
Step 1: Understand list comprehension syntax
The correct Python syntax for conditional expressions inside a list comprehension is: [x if condition else y for x in list].
Step 2: Apply correct condition for replacing OOV words
We want to keep the word if it is in the vocabulary; otherwise, replace it with '<UNK>'. tokens = [word if word in vocab else '<UNK>' for word in tokens] correctly implements this logic.
Final Answer:
tokens = [word if word in vocab else '<UNK>' for word in tokens] -> Option A
Quick Check:
Correct Python conditional list comprehension [OK]
Hint: Remember: x if condition else y inside list comprehensions [OK]
Common Mistakes:
Using incorrect syntax like 'if-else' outside list comprehension
Confusing Python with other languages' ternary syntax
Reversing the condition logic
3. Given the following code snippet, what will be the output?
vocab = {'hello', 'world'}
tokens = ['hello', 'there', 'world']
tokens = [word if word in vocab else '<UNK>' for word in tokens]
print(tokens)
medium
A. ['hello', 'there', 'world']
B. ['hello', 'world', '<UNK>']
C. ['<UNK>', '<UNK>', '<UNK>']
D. ['hello', '<UNK>', 'world']
Solution
Step 1: Check each token against the vocabulary
'hello' is in vocab, so it stays 'hello'. 'there' is not in vocab, so it becomes '<UNK>'. 'world' is in vocab, so it stays 'world'.
Step 2: Construct the resulting list
The new tokens list is ['hello', '<UNK>', 'world'].
Final Answer:
['hello', '<UNK>', 'world'] -> Option D
Quick Check:
Replace OOV words with <UNK> [OK]
Hint: Replace words not in vocab with <UNK> [OK]
Common Mistakes:
Not replacing 'there' because of misunderstanding
Replacing all words regardless of vocab
Confusing list order in output
4. The following code is intended to replace out-of-vocabulary words with <UNK>. What is the error?
vocab = {'cat', 'dog'}
tokens = ['cat', 'bird', 'dog']
tokens = [word if word not in vocab else '<UNK>' for word in tokens]
print(tokens)
medium
A. The vocabulary should be a list, not a set
B. The condition is reversed; it replaces in-vocab words instead of OOV
C. The list comprehension syntax is invalid
D. The print statement is missing parentheses
Solution
Step 1: Analyze the condition in list comprehension
The condition word if word not in vocab else '<UNK>' means words NOT in vocab stay as they are, and words IN vocab become '<UNK>'. This is the opposite of the intended behavior.
Step 2: Identify the correct logic
We want to keep words in vocab and replace words not in vocab with '<UNK>'. So the condition should be word if word in vocab else '<UNK>'.
Final Answer:
The condition is reversed; it replaces in-vocab words instead of OOV -> Option B
Hint: Check if condition matches intended keep-or-replace logic [OK]
Common Mistakes:
Mixing up 'in' and 'not in' in conditions
Assuming set vs list affects membership test
Ignoring Python 3 print syntax
5. You have a pretrained word embedding model that does not include the word 'unicorn'. Which approach best helps your model handle this out-of-vocabulary word during inference?
hard
A. Use subword tokenization to break 'unicorn' into known parts
B. Ignore 'unicorn' and remove it from the input
C. Add 'unicorn' to the vocabulary without retraining
D. Replace 'unicorn' with <UNK> token embedding
Solution
Step 1: Understand limitations of <UNK> token
Replacing with <UNK> loses specific meaning, which may reduce model accuracy.
Step 2: Consider subword tokenization benefits
Subword tokenization breaks unknown words into smaller known units, allowing the model to infer meaning from parts.
Step 3: Evaluate other options
Ignoring the word loses information; adding it without retraining is not feasible; subword tokenization is the best practical approach.
Final Answer:
Use subword tokenization to break 'unicorn' into known parts -> Option A
Quick Check:
Subword methods handle OOV words better than <UNK> [OK]
Hint: Break unknown words into smaller known pieces with subword tokenization [OK]