Challenge - 5 Problems
Multi-head Attention Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output shape of multi-head attention
Given the following PyTorch code snippet using multi-head attention, what is the shape of the output tensor?
PyTorch
import torch import torch.nn as nn batch_size = 2 seq_len = 5 embed_dim = 16 num_heads = 4 mha = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads) query = torch.rand(seq_len, batch_size, embed_dim) key = torch.rand(seq_len, batch_size, embed_dim) value = torch.rand(seq_len, batch_size, embed_dim) out, _ = mha(query, key, value) print(out.shape)
Attempts:
2 left
💡 Hint
Remember that PyTorch's MultiheadAttention expects input shape (sequence_length, batch_size, embedding_dim) and outputs the same shape.
✗ Incorrect
The MultiheadAttention module in PyTorch takes inputs of shape (seq_len, batch_size, embed_dim) and returns output of the same shape. So the output shape is (5, 2, 16).
❓ Model Choice
intermediate1:30remaining
Choosing number of heads in multi-head attention
You have an embedding dimension of 64 for your model. Which choice of number of heads is valid for PyTorch's MultiheadAttention module?
Attempts:
2 left
💡 Hint
The embedding dimension must be divisible by the number of heads.
✗ Incorrect
The number of heads must divide the embedding dimension evenly. 64 divided by 8 is 8, which is an integer. 10, 7, and 9 do not divide 64 evenly.
❓ Hyperparameter
advanced2:00remaining
Effect of increasing number of heads on model capacity
What is the most accurate effect of increasing the number of heads in a multi-head attention layer while keeping the embedding dimension fixed?
Attempts:
2 left
💡 Hint
Think about how multi-head attention splits the embedding dimension into multiple heads.
✗ Incorrect
Increasing heads splits the embedding dimension into more parts, allowing the model to attend to different information in parallel without increasing total embedding size.
🔧 Debug
advanced2:00remaining
Identifying error in multi-head attention input shapes
What error will this PyTorch code raise when running the multi-head attention layer?
PyTorch
import torch import torch.nn as nn mha = nn.MultiheadAttention(embed_dim=32, num_heads=4) query = torch.rand(10, 3, 32) key = torch.rand(10, 3, 16) value = torch.rand(10, 3, 32) out, _ = mha(query, key, value)
Attempts:
2 left
💡 Hint
Check the embedding dimension of key and query tensors.
✗ Incorrect
The key tensor has embedding dimension 16, while query has 32. They must match for multi-head attention to work.
🧠 Conceptual
expert2:30remaining
Why use multi-head attention instead of single-head attention?
Which of the following best explains the main advantage of multi-head attention over single-head attention?
Attempts:
2 left
💡 Hint
Think about how splitting into multiple heads affects the model's focus.
✗ Incorrect
Multi-head attention splits the embedding into multiple heads, each learning different aspects of the input, improving representation power.