Complete the code to compute the attention scores by multiplying queries and keys.
attention_scores = torch.matmul(queries, [1].transpose(-2, -1))
In self-attention, attention scores are computed by multiplying queries with the transpose of keys.
Complete the code to scale the attention scores by the square root of the key dimension.
scaled_scores = attention_scores / math.sqrt([1])Attention scores are scaled by the square root of the key dimension to stabilize gradients.
Fix the error in applying softmax to the attention scores along the correct dimension.
attention_weights = torch.nn.functional.softmax(attention_scores, dim=[1])Softmax should be applied along the last dimension (keys dimension) to get proper attention weights.
Fill both blanks to compute multi-head attention output by concatenating heads and applying a linear layer.
multihead_output = self.linear_out(torch.cat([1], dim=[2]))
Multi-head outputs are concatenated along the last dimension before passing through a linear layer.
Fill all three blanks to implement scaled dot-product attention: compute scores, apply softmax, and multiply by values.
scores = torch.matmul(queries, [1].transpose(-2, -1)) / math.sqrt([2]) weights = torch.nn.functional.softmax(scores, dim=[3]) output = torch.matmul(weights, values)
Scaled dot-product attention multiplies queries by keys, scales by sqrt of key dimension, applies softmax along last dimension, then multiplies by values.