beginner

What is the main idea behind multi-head attention?

Multi-head attention splits the attention mechanism into several smaller parts (heads) that run in parallel. Each head learns different relationships in the data, and their results are combined to capture richer information.

Click to reveal answer

beginner

In multi-head attention, what are the Query, Key, and Value?

Query, Key, and Value are three sets of vectors derived from the input data. The Query is what we want to find information about, the Key helps match the Query, and the Value holds the actual information to be gathered.

Click to reveal answer

intermediate

Why do we use multiple heads instead of one in attention?

Using multiple heads allows the model to focus on different parts or aspects of the input simultaneously. This helps the model understand complex patterns better than a single attention head.

Click to reveal answer

intermediate

What is the shape of the output from a multi-head attention layer in PyTorch?

The output shape is usually (sequence_length, batch_size, embedding_dim), where embedding_dim is the original input dimension after combining all heads.

Click to reveal answer

intermediate

How is the scaled dot-product attention computed inside each head?

Scaled dot-product attention is computed by taking the dot product of Query and Key, dividing by the square root of the key dimension to scale, applying softmax to get weights, and then multiplying by the Value vectors.

Click to reveal answer

What does each head in multi-head attention learn?

ADifferent relationships or features from the input

BThe same information repeatedly

COnly the first token of the input

DRandom noise

What is the purpose of scaling the dot product in scaled dot-product attention?

ATo normalize the output to zero mean

BTo prevent large values that can make softmax gradients too small

CTo increase the dot product values

DTo reduce the number of heads

In PyTorch, which class implements multi-head attention?

Atorch.nn.AttentionHead

Btorch.nn.AttentionLayer

Ctorch.nn.MultiHeadLayer

Dtorch.nn.MultiheadAttention

What is combined after all attention heads compute their outputs?

AThe outputs are concatenated and projected back to the original dimension

BOnly the first head's output is used

CThe outputs are averaged without projection

DThe outputs are discarded

Which of these is NOT a component of multi-head attention?

AQuery

BValue

CBias

DKey

Explain how multi-head attention works and why it is useful in simple terms.

Describe the role of Query, Key, and Value in the attention mechanism.