The attention mechanism helps the model weigh different parts of the input differently for each output, allowing it to focus on relevant information dynamically.
import torch batch_size = 2 seq_len_q = 3 seq_len_k = 4 d_k = 5 Q = torch.randn(batch_size, seq_len_q, d_k) K = torch.randn(batch_size, seq_len_k, d_k) attention_scores = torch.bmm(Q, K.transpose(1, 2)) print(attention_scores.shape)
The batch matrix multiplication of Q (batch_size, seq_len_q, d_k) and Kแต (batch_size, d_k, seq_len_k) results in (batch_size, seq_len_q, seq_len_k).
The Transformer model introduced scaled dot-product attention to efficiently compute attention scores and improve training stability.
Scaling by 1/โd_k keeps the dot products at a scale that prevents softmax from saturating, which helps maintain meaningful gradients during training.
While attention weights highlight important input parts, they do not always align perfectly with the model's true decision process, so interpretability is limited.