Which of the following best explains why attention mechanisms improved deep learning models?
Think about how humans pay attention to important details when processing information.
Attention helps models weigh different parts of the input differently, allowing better understanding of context and relationships.
What is the output tensor after computing scaled dot-product attention scores for the given query and key tensors?
import torch import torch.nn.functional as F query = torch.tensor([[1., 0., 1.]]) # shape (1, 3) key = torch.tensor([[1., 0., 0.], [0., 1., 1.]]) # shape (2, 3) scores = torch.matmul(query, key.T) / (3 ** 0.5) output = F.softmax(scores, dim=1) print(output)
Recall that softmax converts scores into probabilities summing to 1.
The dot product between query and keys is scaled and then softmaxed, producing probabilities that sum to 1.
Which model architecture best uses attention to handle long-range dependencies in sequences?
Think about which model can directly relate all parts of a sequence to each other.
Transformers use self-attention to connect all sequence positions, capturing long-range dependencies effectively.
What is the typical effect of increasing the number of attention heads in a multi-head attention layer?
Consider how multiple heads can look at different parts or aspects of the input.
Multiple attention heads let the model capture diverse features and relationships in the input data.
Given a trained attention model, which metric best helps quantify how focused the attention distribution is on a few key inputs?
Think about a measure that shows how spread out or concentrated a probability distribution is.
Entropy measures uncertainty; lower entropy means attention is focused on fewer inputs, aiding explainability.