Practice - 5 Tasks
Answer the questions below
1fill in blank
easyComplete the code to create a linear layer for query vectors in self-attention.
PyTorch
import torch.nn as nn class SelfAttention(nn.Module): def __init__(self, embed_size): super(SelfAttention, self).__init__() self.query = nn.Linear(embed_size, [1])
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using a different output size than embed_size causes shape mismatch errors.
Setting output size to 1 loses information needed for attention.
✗ Incorrect
The query linear layer projects the input embedding to the same embedding size for self-attention.
2fill in blank
mediumComplete the code to compute the attention scores using matrix multiplication.
PyTorch
def forward(self, values, keys, queries): scores = torch.matmul(queries, [1].transpose(-2, -1)) return scores
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Multiplying queries with values instead of keys.
Not transposing keys before multiplication.
✗ Incorrect
Attention scores are computed by multiplying queries with the transpose of keys.
3fill in blank
hardFix the error in scaling the attention scores by the square root of the key dimension.
PyTorch
import torch import math def forward(self, values, keys, queries): d_k = keys.size(-1) scores = torch.matmul(queries, keys.transpose(-2, -1)) scaled_scores = scores / math.sqrt([1]) return scaled_scores
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using queries size instead of keys size for scaling.
Using batch size or wrong dimension for scaling.
✗ Incorrect
The scaling factor is the square root of the key dimension, stored in d_k.
4fill in blank
hardFill both blanks to apply softmax on scaled scores and multiply by values to get attention output.
PyTorch
import torch.nn.functional as F def forward(self, values, keys, queries): d_k = keys.size(-1) scores = torch.matmul(queries, keys.transpose(-2, -1)) / math.sqrt(d_k) attention = F.[1](scores, dim=[2]) output = torch.matmul(attention, values) return output
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using sigmoid or relu instead of softmax for attention weights.
Applying softmax on wrong dimension.
✗ Incorrect
Softmax is applied along the last dimension (dim=-1) to get attention weights.
5fill in blank
hardFill all three blanks to complete a self-attention forward pass with query, key, value projections and output.
PyTorch
def forward(self, x): queries = self.query(x) keys = self.key(x) values = self.value(x) d_k = keys.size(-1) scores = torch.matmul(queries, keys.transpose(-2, -1)) / math.sqrt([1]) attention = torch.nn.functional.[2](scores, dim=[3]) out = torch.matmul(attention, values) return out
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using embed_size instead of d_k for scaling.
Using wrong activation instead of softmax.
Applying softmax on wrong dimension.
✗ Incorrect
Use d_k for scaling, softmax for attention weights, and dim=-1 for softmax dimension.