Challenge - 5 Problems

🎖️

OOV Mastery Badge

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

1:30remaining

What is the main purpose of using an UNK token in NLP models?

In natural language processing, when a word is not found in the model's vocabulary, an UNK token is often used. What is the main purpose of this token?

ATo remove unknown words from the input text before processing to avoid errors.

BTo replace unknown words with their synonyms from a dictionary.

CTo increase the vocabulary size dynamically during model training.

DTo represent all out-of-vocabulary words with a single placeholder so the model can handle unknown words during inference.

Attempts:

2 left

❓ Predict Output

intermediate

1:30remaining

What is the output of this tokenization with OOV handling?

Given the vocabulary {'hello': 1, 'world': 2} and the sentence 'hello there world', what is the tokenized output using 0 as the UNK token index?

NLP

vocab = {'hello': 1, 'world': 2}
sentence = 'hello there world'
tokens = [vocab.get(word, 0) for word in sentence.split()]
print(tokens)

A[1, 1, 2]

B[1, 0, 2]

C[0, 1, 2]

D[1, 2, 0]

Attempts:

2 left

❓ Model Choice

advanced

2:00remaining

Which model architecture is best suited to handle out-of-vocabulary words using subword units?

You want to build an NLP model that can handle out-of-vocabulary words effectively by breaking words into smaller parts. Which model architecture or technique is best for this?

AA model using Byte Pair Encoding (BPE) or WordPiece tokenization to split words into subword units.

BA model that uses a fixed vocabulary of whole words only.

CA model that replaces all unknown words with a single UNK token without subword splitting.

DA model that ignores any word not in the vocabulary during training and inference.

Attempts:

2 left

❓ Hyperparameter

advanced

1:30remaining

Which hyperparameter affects the size of the vocabulary and thus the handling of OOV words in subword tokenization?

When using Byte Pair Encoding (BPE) for tokenization, which hyperparameter directly controls the vocabulary size and impacts how many out-of-vocabulary words appear?

AThe batch size used during model training.

BThe learning rate of the model optimizer.

CThe number of merge operations performed during BPE training.

DThe dropout rate applied in the model layers.

Attempts:

2 left

❓ Metrics

expert

2:00remaining

How does the presence of many out-of-vocabulary words affect the model's perplexity on a test set?

You evaluate a language model on a test set containing many out-of-vocabulary (OOV) words. How does this typically affect the model's perplexity metric?

APerplexity increases because the model struggles to predict unknown words, indicating worse performance.

BPerplexity becomes zero because the model replaces all OOV words with UNK tokens.

CPerplexity remains unchanged because OOV words are ignored during evaluation.

DPerplexity decreases because unknown words are easier to predict.

Attempts:

2 left

Practice

(1/5)

1. What is the main purpose of using an <UNK> token in natural language processing?

easy

A. To separate words in a sentence

B. To mark the end of a sentence

C. To represent words not seen during training

D. To highlight important keywords

Handling out-of-vocabulary words in NLP - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of `<UNK>` token

Step 2: Identify the correct purpose

Final Answer:

Quick Check:

Solution

Step 1: Understand list comprehension syntax

Step 2: Apply correct condition for replacing OOV words

Final Answer:

Quick Check:

Solution

Step 1: Check each token against the vocabulary

Step 2: Construct the resulting list

Final Answer:

Quick Check:

Solution

Step 1: Analyze the condition in list comprehension

Step 2: Identify the correct logic

Final Answer:

Quick Check:

Solution

Step 1: Understand limitations of `<UNK>` token

Step 2: Consider subword tokenization benefits

Step 3: Evaluate other options

Final Answer:

Quick Check:

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of <UNK> token

Step 2: Identify the correct purpose

Final Answer:

Quick Check:

Solution

Step 1: Understand list comprehension syntax

Step 2: Apply correct condition for replacing OOV words

Final Answer:

Quick Check:

Solution

Step 1: Check each token against the vocabulary

Step 2: Construct the resulting list

Final Answer:

Quick Check:

Solution

Step 1: Analyze the condition in list comprehension

Step 2: Identify the correct logic

Final Answer:

Quick Check:

Solution

Step 1: Understand limitations of <UNK> token

Step 2: Consider subword tokenization benefits

Step 3: Evaluate other options

Final Answer:

Quick Check:

Step 1: Understand the role of `<UNK>` token

Step 1: Understand limitations of `<UNK>` token