0
0
NLPml~20 mins

Attention mechanism in depth in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Attention Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of scaled dot-product attention calculation
What is the output of the following scaled dot-product attention calculation code snippet?
NLP
import torch
import torch.nn.functional as F

query = torch.tensor([[1.0, 0.0, 1.0]])  # shape (1, 3)
key = torch.tensor([[1.0, 0.0, 1.0], [0.0, 1.0, 0.0]])  # shape (2, 3)
value = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # shape (2, 2)

# Calculate attention scores
scores = torch.matmul(query, key.T) / (key.shape[1] ** 0.5)  # scale by sqrt(d_k)
weights = F.softmax(scores, dim=1)

# Weighted sum of values
output = torch.matmul(weights, value)
print(output)
A[[1.4795 2.4795]]
B[[3.0 4.0]]
C[[1.5 2.5]]
D[[2.0 3.0]]
Attempts:
2 left
💡 Hint
Recall that scaled dot-product attention uses softmax on scaled scores to get weights.
🧠 Conceptual
intermediate
1:30remaining
Purpose of the 'key' in attention mechanism
In the attention mechanism, what is the main role of the 'key' vectors?
AThey normalize the query vectors before similarity calculation.
BThey are used to generate the final output directly without matching.
CThey represent the information to be retrieved by matching with the query.
DThey store the positional encoding of the input sequence.
Attempts:
2 left
💡 Hint
Think about how queries find relevant information in keys.
Hyperparameter
advanced
2:00remaining
Effect of increasing number of attention heads
What is the main effect of increasing the number of attention heads in a multi-head attention model?
AIt eliminates the need for positional encoding.
BIt reduces the total number of parameters in the model.
CIt increases the sequence length the model can process without changing computation.
DIt allows the model to jointly attend to information from different representation subspaces at different positions.
Attempts:
2 left
💡 Hint
Think about how multiple heads help the model see different aspects of the input.
Metrics
advanced
1:30remaining
Interpreting attention weights distribution
If an attention layer outputs very uniform attention weights across all positions, what does this typically indicate about the model's focus?
AThe model is confidently focusing on a single important position.
BThe model is uncertain and is distributing focus evenly, possibly due to lack of strong signals.
CThe model has overfitted and memorized the training data.
DThe model is ignoring the input sequence entirely.
Attempts:
2 left
💡 Hint
Uniform weights mean no position stands out more than others.
🔧 Debug
expert
2:30remaining
Identifying error in custom attention implementation
Consider this simplified custom attention code snippet. What error will it raise when run? ```python import torch import torch.nn.functional as F def custom_attention(query, key, value): scores = torch.matmul(query, key) # shape mismatch possible weights = F.softmax(scores, dim=-1) output = torch.matmul(weights, value) return output q = torch.randn(1, 3) k = torch.randn(2, 3) v = torch.randn(2, 4) result = custom_attention(q, k, v) print(result) ```
ARuntimeError due to shape mismatch in torch.matmul(query, key)
BRuntimeError due to shape mismatch in torch.matmul(weights, value)
CNo error; outputs a tensor of shape (1, 4)
DSyntaxError due to missing colon in function definition
Attempts:
2 left
💡 Hint
Check the shapes of query and key tensors before multiplication.