0
0
NLPml~20 mins

Self-attention and multi-head attention in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Self-Attention Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What does self-attention compute in a transformer?
In a transformer model, self-attention helps the model focus on different parts of the input sequence. What exactly does self-attention compute?
AIt applies a convolution operation over the input sequence to extract local features.
BIt computes a weighted sum of the input elements where weights depend on the similarity between elements.
CIt computes the average of all input tokens without considering their relationships.
DIt sorts the input tokens based on their frequency in the sequence.
Attempts:
2 left
💡 Hint
Think about how the model decides which words to pay attention to when processing a sentence.
Predict Output
intermediate
2:00remaining
Output shape of multi-head attention layer
Given the following PyTorch code snippet for a multi-head attention layer, what is the shape of the output tensor?
NLP
import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
embed_dim = 16
num_heads = 4

mha = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads)
query = torch.rand(seq_len, batch_size, embed_dim)
key = torch.rand(seq_len, batch_size, embed_dim)
value = torch.rand(seq_len, batch_size, embed_dim)

output, _ = mha(query, key, value)
print(output.shape)
A(5, 2, 16)
B(2, 16, 5)
C(5, 16, 2)
D(2, 5, 16)
Attempts:
2 left
💡 Hint
Check the expected input and output shapes for PyTorch's MultiheadAttention module.
Hyperparameter
advanced
2:00remaining
Effect of increasing number of heads in multi-head attention
What is the main effect of increasing the number of heads in a multi-head attention mechanism while keeping the total embedding dimension fixed?
AThe attention weights become uniform across all tokens.
BThe total embedding dimension increases, making the model slower but more accurate.
CThe model ignores positional information and treats all tokens equally.
DEach head has a smaller dimension, allowing the model to focus on different representation subspaces.
Attempts:
2 left
💡 Hint
Think about how splitting embedding dimension among heads affects representation.
Metrics
advanced
2:00remaining
Interpreting attention weights in self-attention
In a trained transformer model, you extract the attention weights from a self-attention layer. What does a high attention weight between two tokens indicate?
AThe tokens appear consecutively in the input sequence.
BThe two tokens are identical in the input sequence.
CThe model considers the two tokens highly related or important to each other for the current task.
DThe tokens have the same part of speech tag.
Attempts:
2 left
💡 Hint
Attention weights show how much one token focuses on another during processing.
🔧 Debug
expert
3:00remaining
Identifying error in custom multi-head attention implementation
Consider this simplified custom multi-head attention code snippet. What error will it raise when run?
NLP
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleMultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.q_linear = nn.Linear(embed_dim, embed_dim)
        self.k_linear = nn.Linear(embed_dim, embed_dim)
        self.v_linear = nn.Linear(embed_dim, embed_dim)
        self.out_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, embed_dim = x.size()
        Q = self.q_linear(x)
        K = self.k_linear(x)
        V = self.v_linear(x)

        # reshape for multi-head
        Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim)
        K = K.view(batch_size, seq_len, self.num_heads, self.head_dim)
        V = V.view(batch_size, seq_len, self.num_heads, self.head_dim)

        # transpose to get dimensions batch_size, num_heads, seq_len, head_dim
        Q = Q.transpose(1, 2)
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)

        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attn = F.softmax(scores, dim=-1)
        out = torch.matmul(attn, V)

        # concatenate heads
        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, embed_dim)
        out = self.out_linear(out)
        return out

x = torch.rand(2, 5, 16)
model = SimpleMultiHeadAttention(embed_dim=16, num_heads=4)
output = model(x)
ANo error, code runs successfully and output shape is (2, 5, 16)
BAttributeError because 'SimpleMultiHeadAttention' object has no attribute 'out_linear'
CRuntimeError due to shape mismatch in torch.matmul(Q, K.transpose(-2, -1))
DRuntimeError due to incorrect batch size in input tensor
Attempts:
2 left
💡 Hint
Check the tensor shapes carefully during reshaping and matrix multiplication.