0
0
PyTorchml~10 mins

Multi-head attention in PyTorch - Interactive Code Practice

Choose your learning style9 modes available
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to create a multi-head attention layer with 8 heads.

PyTorch
import torch.nn as nn

multihead_attn = nn.MultiheadAttention(embed_dim=64, num_heads=[1])
Drag options to blanks, or click blank then click option'
A8
B4
C16
D32
Attempts:
3 left
💡 Hint
Common Mistakes
Choosing a number of heads that does not divide the embedding dimension evenly.
Using too few or too many heads without considering model size.
2fill in blank
medium

Complete the code to apply multi-head attention on query, key, and value tensors.

PyTorch
output, attn_weights = multihead_attn([1], key, value)
Drag options to blanks, or click blank then click option'
Aquery
Bkey
Cvalue
Doutput
Attempts:
3 left
💡 Hint
Common Mistakes
Passing key or value as the first argument instead of query.
Confusing the order of inputs.
3fill in blank
hard

Fix the error in the code to correctly reshape the output of multi-head attention.

PyTorch
batch_size, seq_len, embed_dim = x.size()
output = output.transpose(0, 1).reshape([1], seq_len, embed_dim)
Drag options to blanks, or click blank then click option'
Aseq_len
Boutput.size(0)
Cembed_dim
Dbatch_size
Attempts:
3 left
💡 Hint
Common Mistakes
Using seq_len instead of batch_size in reshape.
Not transposing before reshaping.
4fill in blank
hard

Fill both blanks to create a mask that prevents attention to future tokens.

PyTorch
import torch
seq_len = 5
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=[1]) == [2]
Drag options to blanks, or click blank then click option'
A1
B0
CTrue
DFalse
Attempts:
3 left
💡 Hint
Common Mistakes
Using diagonal=0 which masks the main diagonal too.
Comparing to True instead of 0.
5fill in blank
hard

Fill all three blanks to compute scaled dot-product attention manually.

PyTorch
import torch
import torch.nn.functional as F
import math

d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt([1])
attn = F.softmax(scores, dim=[2])
output = torch.matmul(attn, [3])
Drag options to blanks, or click blank then click option'
Ad_k
B-1
Cvalue
Dquery
Attempts:
3 left
💡 Hint
Common Mistakes
Using query instead of value in the last multiplication.
Applying softmax over wrong dimension.
Forgetting to scale scores.