Recall & Review
beginner
What is the main idea behind multi-head attention?
Multi-head attention splits the attention mechanism into several smaller parts (heads) that run in parallel. Each head learns different relationships in the data, and their results are combined to capture richer information.
Click to reveal answer
beginner
In multi-head attention, what are the Query, Key, and Value?
Query, Key, and Value are three sets of vectors derived from the input data. The Query is what we want to find information about, the Key helps match the Query, and the Value holds the actual information to be gathered.
Click to reveal answer
intermediate
Why do we use multiple heads instead of one in attention?
Using multiple heads allows the model to focus on different parts or aspects of the input simultaneously. This helps the model understand complex patterns better than a single attention head.
Click to reveal answer
intermediate
What is the shape of the output from a multi-head attention layer in PyTorch?
The output shape is usually (sequence_length, batch_size, embedding_dim), where embedding_dim is the original input dimension after combining all heads.
Click to reveal answer
intermediate
How is the scaled dot-product attention computed inside each head?
Scaled dot-product attention is computed by taking the dot product of Query and Key, dividing by the square root of the key dimension to scale, applying softmax to get weights, and then multiplying by the Value vectors.
Click to reveal answer
What does each head in multi-head attention learn?
✗ Incorrect
Each head focuses on different parts or aspects of the input to capture diverse information.
What is the purpose of scaling the dot product in scaled dot-product attention?
✗ Incorrect
Scaling by the square root of the key dimension keeps the dot products in a range that avoids very small gradients after softmax.
In PyTorch, which class implements multi-head attention?
✗ Incorrect
PyTorch provides torch.nn.MultiheadAttention for multi-head attention implementation.
What is combined after all attention heads compute their outputs?
✗ Incorrect
Outputs from all heads are concatenated and then passed through a linear layer to match the original embedding size.
Which of these is NOT a component of multi-head attention?
✗ Incorrect
Query, Key, and Value are core components; bias is not a fundamental part of multi-head attention.
Explain how multi-head attention works and why it is useful in simple terms.
Think about how looking at something from different angles helps understand it better.
You got /4 concepts.
Describe the role of Query, Key, and Value in the attention mechanism.
Imagine searching for a book in a library using a catalog.
You got /4 concepts.