Bird
Raised Fist0
NLPml~20 mins

Self-attention and multi-head attention in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Self-Attention Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What does self-attention compute in a transformer?
In a transformer model, self-attention helps the model focus on different parts of the input sequence. What exactly does self-attention compute?
AIt applies a convolution operation over the input sequence to extract local features.
BIt computes a weighted sum of the input elements where weights depend on the similarity between elements.
CIt computes the average of all input tokens without considering their relationships.
DIt sorts the input tokens based on their frequency in the sequence.
Attempts:
2 left
💡 Hint
Think about how the model decides which words to pay attention to when processing a sentence.
Predict Output
intermediate
2:00remaining
Output shape of multi-head attention layer
Given the following PyTorch code snippet for a multi-head attention layer, what is the shape of the output tensor?
NLP
import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
embed_dim = 16
num_heads = 4

mha = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads)
query = torch.rand(seq_len, batch_size, embed_dim)
key = torch.rand(seq_len, batch_size, embed_dim)
value = torch.rand(seq_len, batch_size, embed_dim)

output, _ = mha(query, key, value)
print(output.shape)
A(5, 2, 16)
B(2, 16, 5)
C(5, 16, 2)
D(2, 5, 16)
Attempts:
2 left
💡 Hint
Check the expected input and output shapes for PyTorch's MultiheadAttention module.
Hyperparameter
advanced
2:00remaining
Effect of increasing number of heads in multi-head attention
What is the main effect of increasing the number of heads in a multi-head attention mechanism while keeping the total embedding dimension fixed?
AThe attention weights become uniform across all tokens.
BThe total embedding dimension increases, making the model slower but more accurate.
CThe model ignores positional information and treats all tokens equally.
DEach head has a smaller dimension, allowing the model to focus on different representation subspaces.
Attempts:
2 left
💡 Hint
Think about how splitting embedding dimension among heads affects representation.
Metrics
advanced
2:00remaining
Interpreting attention weights in self-attention
In a trained transformer model, you extract the attention weights from a self-attention layer. What does a high attention weight between two tokens indicate?
AThe tokens appear consecutively in the input sequence.
BThe two tokens are identical in the input sequence.
CThe model considers the two tokens highly related or important to each other for the current task.
DThe tokens have the same part of speech tag.
Attempts:
2 left
💡 Hint
Attention weights show how much one token focuses on another during processing.
🔧 Debug
expert
3:00remaining
Identifying error in custom multi-head attention implementation
Consider this simplified custom multi-head attention code snippet. What error will it raise when run?
NLP
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleMultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.q_linear = nn.Linear(embed_dim, embed_dim)
        self.k_linear = nn.Linear(embed_dim, embed_dim)
        self.v_linear = nn.Linear(embed_dim, embed_dim)
        self.out_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, embed_dim = x.size()
        Q = self.q_linear(x)
        K = self.k_linear(x)
        V = self.v_linear(x)

        # reshape for multi-head
        Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim)
        K = K.view(batch_size, seq_len, self.num_heads, self.head_dim)
        V = V.view(batch_size, seq_len, self.num_heads, self.head_dim)

        # transpose to get dimensions batch_size, num_heads, seq_len, head_dim
        Q = Q.transpose(1, 2)
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)

        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attn = F.softmax(scores, dim=-1)
        out = torch.matmul(attn, V)

        # concatenate heads
        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, embed_dim)
        out = self.out_linear(out)
        return out

x = torch.rand(2, 5, 16)
model = SimpleMultiHeadAttention(embed_dim=16, num_heads=4)
output = model(x)
ANo error, code runs successfully and output shape is (2, 5, 16)
BAttributeError because 'SimpleMultiHeadAttention' object has no attribute 'out_linear'
CRuntimeError due to shape mismatch in torch.matmul(Q, K.transpose(-2, -1))
DRuntimeError due to incorrect batch size in input tensor
Attempts:
2 left
💡 Hint
Check the tensor shapes carefully during reshaping and matrix multiplication.

Practice

(1/5)
1. What is the main purpose of self-attention in natural language processing?
easy
A. To reduce the size of the input data by removing words
B. To generate random sentences without context
C. To translate text from one language to another
D. To let the model focus on important words by comparing all words to each other

Solution

  1. Step 1: Understand self-attention's role

    Self-attention helps the model look at all words in a sentence and decide which ones are important by comparing them to each other.
  2. Step 2: Match purpose with options

    To let the model focus on important words by comparing all words to each other correctly describes this focus mechanism, while others describe unrelated tasks.
  3. Final Answer:

    To let the model focus on important words by comparing all words to each other -> Option D
  4. Quick Check:

    Self-attention = focus on important words [OK]
Hint: Self-attention means comparing words to find importance [OK]
Common Mistakes:
  • Confusing self-attention with translation
  • Thinking self-attention removes words
  • Assuming it generates random text
2. Which of the following is the correct way to describe multi-head attention?
easy
A. Running several self-attention processes in parallel to get richer understanding
B. Applying self-attention only once on the input
C. Using attention only on the first word of a sentence
D. Ignoring word relationships and focusing on word order only

Solution

  1. Step 1: Recall multi-head attention definition

    Multi-head attention means running multiple self-attention operations at the same time to capture different aspects of word relationships.
  2. Step 2: Compare options to definition

    Running several self-attention processes in parallel to get richer understanding matches this exactly; others describe incomplete or incorrect ideas.
  3. Final Answer:

    Running several self-attention processes in parallel to get richer understanding -> Option A
  4. Quick Check:

    Multi-head attention = multiple self-attentions [OK]
Hint: Multi-head means many self-attentions at once [OK]
Common Mistakes:
  • Thinking multi-head means single attention
  • Believing it focuses only on first word
  • Ignoring word relationships
3. Given the following simplified self-attention scores matrix for a 3-word sentence:
Scores = [[1, 0.5, 0], [0.5, 1, 0.2], [0, 0.2, 1]]
What is the attention weight for the second word attending to the third word after applying softmax on its row?
medium
A. Approximately 0.21
B. Approximately 0.50
C. Approximately 0.29
D. Approximately 0.70

Solution

  1. Step 1: Extract the second row scores

    The second word's scores are [0.5, 1, 0.2].
  2. Step 2: Apply softmax to these scores

    Softmax formula: exp(score) / sum(exp(all scores)). Calculate exp(0.5)=1.65, exp(1)=2.72, exp(0.2)=1.22. Sum = 1.65+2.72+1.22=5.59. Attention weight for third word = 1.22/5.59 ≈ 0.218.
  3. Final Answer:

    Approximately 0.21 -> Option A
  4. Quick Check:

    Softmax normalizes scores to probabilities [OK]
Hint: Softmax turns scores into probabilities summing to 1 [OK]
Common Mistakes:
  • Forgetting to exponentiate scores
  • Dividing by wrong sum
  • Mixing row and column values
4. Consider this Python code snippet for multi-head attention weights calculation:
import numpy as np

def multi_head_attention(scores_list):
    heads = []
    for scores in scores_list:
        weights = np.exp(scores) / np.sum(np.exp(scores))
        heads.append(weights)
    return np.mean(heads, axis=0)

scores_list = [np.array([1, 0, 2]), np.array([0, 1, 1])]
print(multi_head_attention(scores_list))

What is the main bug in this code?
medium
A. Softmax is applied incorrectly; denominator should sum over exp(scores) per head
B. The function returns mean of weights instead of concatenating heads
C. The code uses np.exp twice causing overflow
D. Scores_list should be a 2D array, not a list of arrays

Solution

  1. Step 1: Analyze softmax calculation

    Softmax is correctly applied per head by dividing exp(scores) by sum of exp(scores).
  2. Step 2: Check output aggregation

    The function averages the weights from each head, but multi-head attention should concatenate or combine heads differently, not average weights element-wise.
  3. Final Answer:

    The function returns mean of weights instead of concatenating heads -> Option B
  4. Quick Check:

    Multi-head attention combines heads, not averages weights [OK]
Hint: Multi-head attention concatenates heads, not averages weights [OK]
Common Mistakes:
  • Thinking averaging weights is correct
  • Confusing softmax denominator
  • Assuming input format is wrong
5. You want to improve a Transformer model's ability to understand complex sentences by increasing the number of attention heads from 4 to 8. What is the most likely effect of this change?
hard
A. The model will ignore word order completely
B. The model will run faster but lose accuracy
C. The model can capture more diverse word relationships but may require more computation
D. The model will only focus on the first half of the sentence

Solution

  1. Step 1: Understand effect of increasing attention heads

    More heads mean the model can look at different parts of the sentence simultaneously, capturing richer relationships.
  2. Step 2: Consider computational cost and accuracy

    Increasing heads usually increases computation and memory needs but can improve understanding and accuracy.
  3. Final Answer:

    The model can capture more diverse word relationships but may require more computation -> Option C
  4. Quick Check:

    More heads = richer focus + more compute [OK]
Hint: More heads = better focus but slower model [OK]
Common Mistakes:
  • Assuming more heads always make model faster
  • Thinking word order is ignored
  • Believing model focuses only on part of sentence