NLPml~20 mins

Attention mechanism in depth in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Challenge - 5 Problems

🎖️

Attention Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of scaled dot-product attention calculation

What is the output of the following scaled dot-product attention calculation code snippet?

NLP

import torch
import torch.nn.functional as F

query = torch.tensor([[1.0, 0.0, 1.0]])  # shape (1, 3)
key = torch.tensor([[1.0, 0.0, 1.0], [0.0, 1.0, 0.0]])  # shape (2, 3)
value = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # shape (2, 2)

# Calculate attention scores
scores = torch.matmul(query, key.T) / (key.shape[1] ** 0.5)  # scale by sqrt(d_k)
weights = F.softmax(scores, dim=1)

# Weighted sum of values
output = torch.matmul(weights, value)
print(output)

A[[1.4795 2.4795]]

B[[3.0 4.0]]

C[[1.5 2.5]]

D[[2.0 3.0]]

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Purpose of the 'key' in attention mechanism

In the attention mechanism, what is the main role of the 'key' vectors?

AThey normalize the query vectors before similarity calculation.

BThey are used to generate the final output directly without matching.

CThey represent the information to be retrieved by matching with the query.

DThey store the positional encoding of the input sequence.

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Effect of increasing number of attention heads

What is the main effect of increasing the number of attention heads in a multi-head attention model?

AIt eliminates the need for positional encoding.

BIt reduces the total number of parameters in the model.

CIt increases the sequence length the model can process without changing computation.

DIt allows the model to jointly attend to information from different representation subspaces at different positions.

Attempts:

2 left

❓ Metrics

advanced

1:30remaining

Interpreting attention weights distribution

If an attention layer outputs very uniform attention weights across all positions, what does this typically indicate about the model's focus?

AThe model is confidently focusing on a single important position.

BThe model is uncertain and is distributing focus evenly, possibly due to lack of strong signals.

CThe model has overfitted and memorized the training data.

DThe model is ignoring the input sequence entirely.

Attempts:

2 left

🔧 Debug

expert

2:30remaining

Identifying error in custom attention implementation

Consider this simplified custom attention code snippet. What error will it raise when run? ```python import torch import torch.nn.functional as F def custom_attention(query, key, value): scores = torch.matmul(query, key) # shape mismatch possible weights = F.softmax(scores, dim=-1) output = torch.matmul(weights, value) return output q = torch.randn(1, 3) k = torch.randn(2, 3) v = torch.randn(2, 4) result = custom_attention(q, k, v) print(result) ```

ARuntimeError due to shape mismatch in torch.matmul(query, key)

BRuntimeError due to shape mismatch in torch.matmul(weights, value)

CNo error; outputs a tensor of shape (1, 4)

DSyntaxError due to missing colon in function definition

Attempts:

2 left

Practice

(1/5)

1. What is the main purpose of the attention mechanism in NLP models?

easy

A. To increase the size of the input data

B. To reduce the number of layers in the model

C. To help the model focus on important parts of the input data

D. To randomly shuffle the input tokens

Attention mechanism in depth in NLP - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand attention's role

Step 2: Compare options

Final Answer:

Quick Check:

Solution

Step 1: Recall attention weight calculation

Step 2: Evaluate options

Final Answer:

Quick Check:

Solution

Step 1: Calculate dot products Q x K^T

Step 2: Apply softmax to scores

Step 3: Compute weighted sum of values

Step 4: Match option

Final Answer:

Quick Check:

Solution

Step 1: Check dot product operation

Step 2: Analyze code

Final Answer:

Quick Check:

Solution

Step 1: Understand dot product scaling

Step 2: Role of scaling by sqrt of key dimension

Final Answer:

Quick Check: