0
0
PyTorchml~20 mins

Multi-head attention in PyTorch - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Multi-head Attention Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output shape of multi-head attention
Given the following PyTorch code snippet using multi-head attention, what is the shape of the output tensor?
PyTorch
import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
embed_dim = 16
num_heads = 4

mha = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads)
query = torch.rand(seq_len, batch_size, embed_dim)
key = torch.rand(seq_len, batch_size, embed_dim)
value = torch.rand(seq_len, batch_size, embed_dim)
out, _ = mha(query, key, value)
print(out.shape)
Atorch.Size([5, 2, 16])
Btorch.Size([2, 5, 16])
Ctorch.Size([5, 16, 2])
Dtorch.Size([2, 16, 5])
Attempts:
2 left
💡 Hint
Remember that PyTorch's MultiheadAttention expects input shape (sequence_length, batch_size, embedding_dim) and outputs the same shape.
Model Choice
intermediate
1:30remaining
Choosing number of heads in multi-head attention
You have an embedding dimension of 64 for your model. Which choice of number of heads is valid for PyTorch's MultiheadAttention module?
A10
B7
C9
D8
Attempts:
2 left
💡 Hint
The embedding dimension must be divisible by the number of heads.
Hyperparameter
advanced
2:00remaining
Effect of increasing number of heads on model capacity
What is the most accurate effect of increasing the number of heads in a multi-head attention layer while keeping the embedding dimension fixed?
AIt increases the model's ability to focus on different parts of the input simultaneously without increasing the total embedding size.
BIt increases the embedding dimension, making the model larger and slower.
CIt decreases the model's ability to learn because each head gets fewer parameters.
DIt reduces the number of parameters by sharing weights across heads.
Attempts:
2 left
💡 Hint
Think about how multi-head attention splits the embedding dimension into multiple heads.
🔧 Debug
advanced
2:00remaining
Identifying error in multi-head attention input shapes
What error will this PyTorch code raise when running the multi-head attention layer?
PyTorch
import torch
import torch.nn as nn

mha = nn.MultiheadAttention(embed_dim=32, num_heads=4)
query = torch.rand(10, 3, 32)
key = torch.rand(10, 3, 16)
value = torch.rand(10, 3, 32)
out, _ = mha(query, key, value)
ATypeError: Expected 3D tensor for query
BRuntimeError: The key and query must have the same embedding dimension
CValueError: Number of heads must divide embedding dimension
DNo error, code runs successfully
Attempts:
2 left
💡 Hint
Check the embedding dimension of key and query tensors.
🧠 Conceptual
expert
2:30remaining
Why use multi-head attention instead of single-head attention?
Which of the following best explains the main advantage of multi-head attention over single-head attention?
AMulti-head attention reduces the computational cost compared to single-head attention.
BMulti-head attention always produces smaller output tensors than single-head attention.
CMulti-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
DMulti-head attention eliminates the need for positional encoding in sequence models.
Attempts:
2 left
💡 Hint
Think about how splitting into multiple heads affects the model's focus.