Challenge - 5 Problems

🎖️

Self-Attention Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of scaled dot-product attention calculation

Given the following PyTorch code snippet implementing scaled dot-product attention, what is the output tensor?

PyTorch

import torch
import torch.nn.functional as F

query = torch.tensor([[1., 0., 0.]])  # shape (1, 3)
key = torch.tensor([[1., 0., 0.], [0., 1., 0.]])  # shape (2, 3)
value = torch.tensor([[1., 2.], [3., 4.]])  # shape (2, 2)

# Compute attention scores
scores = torch.matmul(query, key.T) / (3 ** 0.5)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, value)
print(output)

A[[1.9999, 2.9999]]

B[[3.0, 4.0]]

C[[1.718, 2.718]]

D[[1.5, 2.5]]

Attempts:

2 left

❓ Model Choice

intermediate

1:30remaining

Choosing the correct self-attention output shape

In a self-attention layer, if the input tensor has shape (batch_size=4, seq_len=10, embedding_dim=64), what will be the shape of the output tensor after applying multi-head self-attention with 8 heads and the same embedding dimension?

A(4, 10, 512)

B(4, 10, 64)

C(4, 8, 10, 8)

D(4, 64, 10)

Attempts:

2 left

❓ Hyperparameter

advanced

1:30remaining

Effect of increasing number of attention heads

What is the most likely effect of increasing the number of attention heads in a multi-head self-attention model while keeping the total embedding dimension fixed?

AThe model's total capacity increases linearly with the number of heads.

BThe model becomes unable to learn because the embedding dimension is fixed.

CThe embedding dimension per head increases, improving feature representation.

DEach head attends to a smaller subspace, potentially capturing more diverse features.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identifying error in self-attention implementation

Consider this PyTorch code snippet for self-attention. What error will it raise when executed?

PyTorch

import torch
import torch.nn.functional as F

query = torch.randn(2, 5, 16)
key = torch.randn(2, 5, 16)
value = torch.randn(2, 5, 16)

scores = torch.matmul(query, key) / (16 ** 0.5)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, value)
print(output.shape)

ARuntimeError due to shape mismatch in torch.matmul(query, key)

BRuntimeError due to shape mismatch in torch.matmul(weights, value)

CNo error, prints torch.Size([2, 5, 16])

DSyntaxError due to missing colon

Attempts:

2 left

🧠 Conceptual

expert

1:30remaining

Why use scaled dot-product in self-attention?

Why do self-attention mechanisms scale the dot product of query and key vectors by the square root of their dimension?

ATo prevent the dot products from growing too large, which can cause softmax gradients to vanish.

BTo increase the magnitude of dot products, making attention weights sharper.

CTo normalize the query and key vectors to unit length before dot product.

DTo reduce computational cost by scaling down the vectors.

Attempts:

2 left