0
0
PyTorchml~20 mins

Self-attention mechanism in PyTorch - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Self-Attention Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of scaled dot-product attention calculation
Given the following PyTorch code snippet implementing scaled dot-product attention, what is the output tensor?
PyTorch
import torch
import torch.nn.functional as F

query = torch.tensor([[1., 0., 0.]])  # shape (1, 3)
key = torch.tensor([[1., 0., 0.], [0., 1., 0.]])  # shape (2, 3)
value = torch.tensor([[1., 2.], [3., 4.]])  # shape (2, 2)

# Compute attention scores
scores = torch.matmul(query, key.T) / (3 ** 0.5)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, value)
print(output)
A[[1.9999, 2.9999]]
B[[3.0, 4.0]]
C[[1.718, 2.718]]
D[[1.5, 2.5]]
Attempts:
2 left
💡 Hint
Recall that softmax normalizes scores to probabilities, and output is weighted sum of values.
Model Choice
intermediate
1:30remaining
Choosing the correct self-attention output shape
In a self-attention layer, if the input tensor has shape (batch_size=4, seq_len=10, embedding_dim=64), what will be the shape of the output tensor after applying multi-head self-attention with 8 heads and the same embedding dimension?
A(4, 10, 512)
B(4, 10, 64)
C(4, 8, 10, 8)
D(4, 64, 10)
Attempts:
2 left
💡 Hint
Multi-head attention splits embedding_dim into heads but concatenates back to original embedding_dim.
Hyperparameter
advanced
1:30remaining
Effect of increasing number of attention heads
What is the most likely effect of increasing the number of attention heads in a multi-head self-attention model while keeping the total embedding dimension fixed?
AThe model's total capacity increases linearly with the number of heads.
BThe model becomes unable to learn because the embedding dimension is fixed.
CThe embedding dimension per head increases, improving feature representation.
DEach head attends to a smaller subspace, potentially capturing more diverse features.
Attempts:
2 left
💡 Hint
Think about how embedding dimension is split among heads.
🔧 Debug
advanced
2:00remaining
Identifying error in self-attention implementation
Consider this PyTorch code snippet for self-attention. What error will it raise when executed?
PyTorch
import torch
import torch.nn.functional as F

query = torch.randn(2, 5, 16)
key = torch.randn(2, 5, 16)
value = torch.randn(2, 5, 16)

scores = torch.matmul(query, key) / (16 ** 0.5)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, value)
print(output.shape)
ARuntimeError due to shape mismatch in torch.matmul(query, key)
BRuntimeError due to shape mismatch in torch.matmul(weights, value)
CNo error, prints torch.Size([2, 5, 16])
DSyntaxError due to missing colon
Attempts:
2 left
💡 Hint
Check the dimensions of query and key for matrix multiplication.
🧠 Conceptual
expert
1:30remaining
Why use scaled dot-product in self-attention?
Why do self-attention mechanisms scale the dot product of query and key vectors by the square root of their dimension?
ATo prevent the dot products from growing too large, which can cause softmax gradients to vanish.
BTo increase the magnitude of dot products, making attention weights sharper.
CTo normalize the query and key vectors to unit length before dot product.
DTo reduce computational cost by scaling down the vectors.
Attempts:
2 left
💡 Hint
Think about how large dot products affect softmax.