0
0
PyTorchml~5 mins

Multi-head attention in PyTorch - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is the main idea behind multi-head attention?
Multi-head attention splits the attention mechanism into several smaller parts (heads) that run in parallel. Each head learns different relationships in the data, and their results are combined to capture richer information.
Click to reveal answer
beginner
In multi-head attention, what are the Query, Key, and Value?
Query, Key, and Value are three sets of vectors derived from the input data. The Query is what we want to find information about, the Key helps match the Query, and the Value holds the actual information to be gathered.
Click to reveal answer
intermediate
Why do we use multiple heads instead of one in attention?
Using multiple heads allows the model to focus on different parts or aspects of the input simultaneously. This helps the model understand complex patterns better than a single attention head.
Click to reveal answer
intermediate
What is the shape of the output from a multi-head attention layer in PyTorch?
The output shape is usually (sequence_length, batch_size, embedding_dim), where embedding_dim is the original input dimension after combining all heads.
Click to reveal answer
intermediate
How is the scaled dot-product attention computed inside each head?
Scaled dot-product attention is computed by taking the dot product of Query and Key, dividing by the square root of the key dimension to scale, applying softmax to get weights, and then multiplying by the Value vectors.
Click to reveal answer
What does each head in multi-head attention learn?
ADifferent relationships or features from the input
BThe same information repeatedly
COnly the first token of the input
DRandom noise
What is the purpose of scaling the dot product in scaled dot-product attention?
ATo normalize the output to zero mean
BTo prevent large values that can make softmax gradients too small
CTo increase the dot product values
DTo reduce the number of heads
In PyTorch, which class implements multi-head attention?
Atorch.nn.AttentionHead
Btorch.nn.AttentionLayer
Ctorch.nn.MultiHeadLayer
Dtorch.nn.MultiheadAttention
What is combined after all attention heads compute their outputs?
AThe outputs are concatenated and projected back to the original dimension
BOnly the first head's output is used
CThe outputs are averaged without projection
DThe outputs are discarded
Which of these is NOT a component of multi-head attention?
AQuery
BValue
CBias
DKey
Explain how multi-head attention works and why it is useful in simple terms.
Think about how looking at something from different angles helps understand it better.
You got /4 concepts.
    Describe the role of Query, Key, and Value in the attention mechanism.
    Imagine searching for a book in a library using a catalog.
    You got /4 concepts.