Practice - 5 Tasks
Answer the questions below
1fill in blank
easyComplete the code to compute the attention scores using dot product.
NLP
import torch query = torch.randn(1, 5) key = torch.randn(1, 5) attention_scores = torch.matmul(query, [1].T) print(attention_scores)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using query instead of key for the dot product.
Not transposing the key matrix before multiplication.
✗ Incorrect
The attention scores are computed by the dot product of the query and the transpose of the key vectors.
2fill in blank
mediumComplete the code to apply softmax to the attention scores to get attention weights.
NLP
import torch.nn.functional as F attention_scores = torch.tensor([[1.0, 2.0, 3.0]]) attention_weights = F.[1](attention_scores, dim=-1) print(attention_weights)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using sigmoid instead of softmax, which does not normalize across the dimension.
Applying activation functions like relu or tanh which do not produce probabilities.
✗ Incorrect
Softmax converts raw attention scores into probabilities that sum to 1.
3fill in blank
hardFix the error in the code to compute the weighted sum of values using attention weights.
NLP
import torch attention_weights = torch.tensor([[0.1, 0.7, 0.2]]) values = torch.tensor([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]) weighted_sum = torch.matmul([1], values) print(weighted_sum)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Transposing attention weights causing shape mismatch.
Multiplying values by values instead of attention weights.
✗ Incorrect
The weighted sum is the matrix multiplication of attention weights and values.
4fill in blank
hardFill both blanks to scale the attention scores and apply softmax.
NLP
import torch import torch.nn.functional as F query = torch.randn(1, 64) key = torch.randn(10, 64) scores = torch.matmul(query, key.T) / [1] attention_weights = F.[2](scores, dim=-1) print(attention_weights)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using sigmoid instead of softmax for attention weights.
Not scaling scores causing unstable training.
✗ Incorrect
Attention scores are scaled by the square root of the key dimension before softmax to stabilize gradients.
5fill in blank
hardFill all three blanks to implement multi-head attention output concatenation and projection.
NLP
import torch import torch.nn as nn class MultiHeadAttention(nn.Module): def __init__(self, embed_dim, num_heads): super().__init__() self.num_heads = num_heads self.head_dim = embed_dim // num_heads self.linear_out = nn.Linear(embed_dim, embed_dim) def forward(self, x): batch_size, seq_len, embed_dim = x.size() # Assume x is already split into heads and attention applied concat_heads = x.reshape(batch_size, seq_len, [1]) output = self.linear_out([2]) return output attention = MultiHeadAttention(embed_dim=128, num_heads=8) x = torch.randn(2, 10, 128) result = attention(x) print(result.shape)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using wrong dimension in reshape causing size mismatch.
Passing wrong variable to linear_out layer.
✗ Incorrect
The heads are concatenated back to the original embedding dimension before passing through the output linear layer.