Practice - 5 Tasks
Answer the questions below
1fill in blank
easyComplete the code to create a multi-head attention layer with 8 heads.
PyTorch
import torch.nn as nn multihead_attn = nn.MultiheadAttention(embed_dim=64, num_heads=[1])
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Choosing a number of heads that does not divide the embedding dimension evenly.
Using too few or too many heads without considering model size.
✗ Incorrect
The number of heads in multi-head attention is set by the num_heads parameter. Here, 8 heads is the correct choice for this example.
2fill in blank
mediumComplete the code to apply multi-head attention on query, key, and value tensors.
PyTorch
output, attn_weights = multihead_attn([1], key, value) Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Passing key or value as the first argument instead of query.
Confusing the order of inputs.
✗ Incorrect
The first argument to multihead_attn is the query tensor, which is used to compute attention scores.
3fill in blank
hardFix the error in the code to correctly reshape the output of multi-head attention.
PyTorch
batch_size, seq_len, embed_dim = x.size() output = output.transpose(0, 1).reshape([1], seq_len, embed_dim)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using seq_len instead of batch_size in reshape.
Not transposing before reshaping.
✗ Incorrect
The output tensor from multihead_attn has shape (seq_len, batch_size, embed_dim). Transposing dims 0 and 1 swaps seq_len and batch_size, so reshaping should use batch_size as the first dimension.
4fill in blank
hardFill both blanks to create a mask that prevents attention to future tokens.
PyTorch
import torch seq_len = 5 mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=[1]) == [2]
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using diagonal=0 which masks the main diagonal too.
Comparing to True instead of 0.
✗ Incorrect
To mask future tokens, we use torch.triu with diagonal=1 to get upper triangle above the main diagonal. Comparing to 0 creates a boolean mask where True means attendable positions.
5fill in blank
hardFill all three blanks to compute scaled dot-product attention manually.
PyTorch
import torch import torch.nn.functional as F import math d_k = query.size(-1) scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt([1]) attn = F.softmax(scores, dim=[2]) output = torch.matmul(attn, [3])
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using query instead of value in the last multiplication.
Applying softmax over wrong dimension.
Forgetting to scale scores.
✗ Incorrect
Scaled dot-product attention divides scores by sqrt(d_k), applies softmax over last dimension (-1), then multiplies by value to get output.