Challenge - 5 Problems

🎖️

Multi-head Attention Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output shape of multi-head attention

Given the following PyTorch code snippet using multi-head attention, what is the shape of the output tensor?

PyTorch

import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
embed_dim = 16
num_heads = 4

mha = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads)
query = torch.rand(seq_len, batch_size, embed_dim)
key = torch.rand(seq_len, batch_size, embed_dim)
value = torch.rand(seq_len, batch_size, embed_dim)
out, _ = mha(query, key, value)
print(out.shape)

Atorch.Size([5, 2, 16])

Btorch.Size([2, 5, 16])

Ctorch.Size([5, 16, 2])

Dtorch.Size([2, 16, 5])

Attempts:

2 left

❓ Model Choice

intermediate

1:30remaining

Choosing number of heads in multi-head attention

You have an embedding dimension of 64 for your model. Which choice of number of heads is valid for PyTorch's MultiheadAttention module?

A10

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Effect of increasing number of heads on model capacity

What is the most accurate effect of increasing the number of heads in a multi-head attention layer while keeping the embedding dimension fixed?

AIt increases the model's ability to focus on different parts of the input simultaneously without increasing the total embedding size.

BIt increases the embedding dimension, making the model larger and slower.

CIt decreases the model's ability to learn because each head gets fewer parameters.

DIt reduces the number of parameters by sharing weights across heads.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identifying error in multi-head attention input shapes

What error will this PyTorch code raise when running the multi-head attention layer?

PyTorch

import torch
import torch.nn as nn

mha = nn.MultiheadAttention(embed_dim=32, num_heads=4)
query = torch.rand(10, 3, 32)
key = torch.rand(10, 3, 16)
value = torch.rand(10, 3, 32)
out, _ = mha(query, key, value)

ATypeError: Expected 3D tensor for query

BRuntimeError: The key and query must have the same embedding dimension

CValueError: Number of heads must divide embedding dimension

DNo error, code runs successfully

Attempts:

2 left

🧠 Conceptual

expert

2:30remaining

Why use multi-head attention instead of single-head attention?

Which of the following best explains the main advantage of multi-head attention over single-head attention?

AMulti-head attention reduces the computational cost compared to single-head attention.

BMulti-head attention always produces smaller output tensors than single-head attention.

CMulti-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

DMulti-head attention eliminates the need for positional encoding in sequence models.

Attempts:

2 left