Challenge - 5 Problems
Self-Attention Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of scaled dot-product attention calculation
Given the following PyTorch code snippet implementing scaled dot-product attention, what is the output tensor?
PyTorch
import torch import torch.nn.functional as F query = torch.tensor([[1., 0., 0.]]) # shape (1, 3) key = torch.tensor([[1., 0., 0.], [0., 1., 0.]]) # shape (2, 3) value = torch.tensor([[1., 2.], [3., 4.]]) # shape (2, 2) # Compute attention scores scores = torch.matmul(query, key.T) / (3 ** 0.5) weights = F.softmax(scores, dim=-1) output = torch.matmul(weights, value) print(output)
Attempts:
2 left
💡 Hint
Recall that softmax normalizes scores to probabilities, and output is weighted sum of values.
✗ Incorrect
The query matches perfectly with the first key vector, so the attention weight for the first key is near 1, and near 0 for the second. The output is close to the first value vector [1., 2.]. After scaling and softmax, the output is approximately [[1.718, 2.718]].
❓ Model Choice
intermediate1:30remaining
Choosing the correct self-attention output shape
In a self-attention layer, if the input tensor has shape (batch_size=4, seq_len=10, embedding_dim=64), what will be the shape of the output tensor after applying multi-head self-attention with 8 heads and the same embedding dimension?
Attempts:
2 left
💡 Hint
Multi-head attention splits embedding_dim into heads but concatenates back to original embedding_dim.
✗ Incorrect
Multi-head attention splits the embedding dimension into 8 heads of size 8 each (64/8=8), processes attention in parallel, then concatenates back to shape (batch_size, seq_len, embedding_dim), which is (4, 10, 64).
❓ Hyperparameter
advanced1:30remaining
Effect of increasing number of attention heads
What is the most likely effect of increasing the number of attention heads in a multi-head self-attention model while keeping the total embedding dimension fixed?
Attempts:
2 left
💡 Hint
Think about how embedding dimension is split among heads.
✗ Incorrect
Increasing heads splits the fixed embedding dimension into smaller parts per head, allowing each head to focus on different aspects of the input, improving diversity of attention.
🔧 Debug
advanced2:00remaining
Identifying error in self-attention implementation
Consider this PyTorch code snippet for self-attention. What error will it raise when executed?
PyTorch
import torch import torch.nn.functional as F query = torch.randn(2, 5, 16) key = torch.randn(2, 5, 16) value = torch.randn(2, 5, 16) scores = torch.matmul(query, key) / (16 ** 0.5) weights = F.softmax(scores, dim=-1) output = torch.matmul(weights, value) print(output.shape)
Attempts:
2 left
💡 Hint
Check the dimensions of query and key for matrix multiplication.
✗ Incorrect
query has shape (2,5,16) and key has shape (2,5,16). torch.matmul expects the last two dims to be compatible. Here, 16 != 5, so it raises RuntimeError.
🧠 Conceptual
expert1:30remaining
Why use scaled dot-product in self-attention?
Why do self-attention mechanisms scale the dot product of query and key vectors by the square root of their dimension?
Attempts:
2 left
💡 Hint
Think about how large dot products affect softmax.
✗ Incorrect
Without scaling, large dot products can push softmax into regions with very small gradients, making training harder. Scaling by sqrt(dim) keeps values in a range that stabilizes gradients.