Challenge - 5 Problems
Attention Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of scaled dot-product attention calculation
What is the output of the following scaled dot-product attention calculation code snippet?
NLP
import torch import torch.nn.functional as F query = torch.tensor([[1.0, 0.0, 1.0]]) # shape (1, 3) key = torch.tensor([[1.0, 0.0, 1.0], [0.0, 1.0, 0.0]]) # shape (2, 3) value = torch.tensor([[1.0, 2.0], [3.0, 4.0]]) # shape (2, 2) # Calculate attention scores scores = torch.matmul(query, key.T) / (key.shape[1] ** 0.5) # scale by sqrt(d_k) weights = F.softmax(scores, dim=1) # Weighted sum of values output = torch.matmul(weights, value) print(output)
Attempts:
2 left
💡 Hint
Recall that scaled dot-product attention uses softmax on scaled scores to get weights.
✗ Incorrect
The query matches the first key perfectly, so the attention weight is higher for the first value. The scores are scaled by sqrt of key dimension (sqrt(3)) before softmax. The output is the weighted sum of values using these weights.
🧠 Conceptual
intermediate1:30remaining
Purpose of the 'key' in attention mechanism
In the attention mechanism, what is the main role of the 'key' vectors?
Attempts:
2 left
💡 Hint
Think about how queries find relevant information in keys.
✗ Incorrect
Keys act like labels or addresses that queries compare against to find relevant information. The attention scores come from comparing queries to keys.
❓ Hyperparameter
advanced2:00remaining
Effect of increasing number of attention heads
What is the main effect of increasing the number of attention heads in a multi-head attention model?
Attempts:
2 left
💡 Hint
Think about how multiple heads help the model see different aspects of the input.
✗ Incorrect
Multiple attention heads let the model focus on different parts or features of the input simultaneously, improving learning capacity.
❓ Metrics
advanced1:30remaining
Interpreting attention weights distribution
If an attention layer outputs very uniform attention weights across all positions, what does this typically indicate about the model's focus?
Attempts:
2 left
💡 Hint
Uniform weights mean no position stands out more than others.
✗ Incorrect
When attention weights are uniform, the model does not find any position more relevant than others, indicating uncertainty or weak signals.
🔧 Debug
expert2:30remaining
Identifying error in custom attention implementation
Consider this simplified custom attention code snippet. What error will it raise when run?
```python
import torch
import torch.nn.functional as F
def custom_attention(query, key, value):
scores = torch.matmul(query, key) # shape mismatch possible
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, value)
return output
q = torch.randn(1, 3)
k = torch.randn(2, 3)
v = torch.randn(2, 4)
result = custom_attention(q, k, v)
print(result)
```
Attempts:
2 left
💡 Hint
Check the shapes of query and key tensors before multiplication.
✗ Incorrect
The query has shape (1, 3) and key has shape (2, 3). Multiplying (1,3) by (2,3) is invalid because inner dimensions do not match. The key should be transposed.