Self-attention helps a model focus on important parts of a sentence when understanding language. Multi-head attention lets the model look at the sentence from different views at the same time.
Self-attention and multi-head attention in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
Attention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O where head_i = Attention(Q * W_Qi, K * W_Ki, V * W_Vi)
Q, K, V stand for Query, Key, and Value matrices derived from the input.
Multi-head attention runs several attention calculations in parallel, then combines their results.
Q = input_embeddings K = input_embeddings V = input_embeddings output = Attention(Q, K, V)
head_1 = Attention(Q * W_Q1, K * W_K1, V * W_V1) head_2 = Attention(Q * W_Q2, K * W_K2, V * W_V2) output = Concat(head_1, head_2) * W_O
This code creates a simple self-attention layer with two heads. It takes a small input tensor and computes the self-attention output.
import torch import torch.nn as nn import torch.nn.functional as F class SelfAttention(nn.Module): def __init__(self, embed_size, heads): super(SelfAttention, self).__init__() self.embed_size = embed_size self.heads = heads self.head_dim = embed_size // heads assert ( self.head_dim * heads == embed_size ), "Embedding size needs to be divisible by heads" self.values = nn.Linear(self.head_dim, self.head_dim, bias=False) self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False) self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False) self.fc_out = nn.Linear(heads * self.head_dim, embed_size) def forward(self, values, keys, queries): N = queries.shape[0] value_len, key_len, query_len = values.shape[1], keys.shape[1], queries.shape[1] # Split embedding into self.heads pieces values = values.reshape(N, value_len, self.heads, self.head_dim) keys = keys.reshape(N, key_len, self.heads, self.head_dim) queries = queries.reshape(N, query_len, self.heads, self.head_dim) values = self.values(values) keys = self.keys(keys) queries = self.queries(queries) # Einsum does batch matrix multiplication for query*keys for each training example energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys]) # Scale energy energy = energy / (self.head_dim ** 0.5) attention = torch.softmax(energy, dim=3) out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape( N, query_len, self.heads * self.head_dim ) out = self.fc_out(out) return out # Example usage embed_size = 8 heads = 2 self_attention = SelfAttention(embed_size, heads) # Batch size 1, sequence length 3, embedding size 8 x = torch.tensor([[[1., 0., 1., 0., 1., 0., 1., 0.], [0., 1., 0., 1., 0., 1., 0., 1.], [1., 1., 1., 1., 1., 1., 1., 1.]]]) output = self_attention(x, x, x) print(output)
Self-attention helps the model understand relationships between words regardless of their position.
Multi-head attention allows the model to capture different types of relationships at once.
Embedding size must be divisible by the number of heads for splitting.
Self-attention lets a model focus on important words in a sentence by comparing all words to each other.
Multi-head attention runs several self-attention processes in parallel to get richer understanding.
This technique is key in modern language models like Transformers.
Practice
Solution
Step 1: Understand self-attention's role
Self-attention helps the model look at all words in a sentence and decide which ones are important by comparing them to each other.Step 2: Match purpose with options
To let the model focus on important words by comparing all words to each other correctly describes this focus mechanism, while others describe unrelated tasks.Final Answer:
To let the model focus on important words by comparing all words to each other -> Option DQuick Check:
Self-attention = focus on important words [OK]
- Confusing self-attention with translation
- Thinking self-attention removes words
- Assuming it generates random text
Solution
Step 1: Recall multi-head attention definition
Multi-head attention means running multiple self-attention operations at the same time to capture different aspects of word relationships.Step 2: Compare options to definition
Running several self-attention processes in parallel to get richer understanding matches this exactly; others describe incomplete or incorrect ideas.Final Answer:
Running several self-attention processes in parallel to get richer understanding -> Option AQuick Check:
Multi-head attention = multiple self-attentions [OK]
- Thinking multi-head means single attention
- Believing it focuses only on first word
- Ignoring word relationships
Scores = [[1, 0.5, 0], [0.5, 1, 0.2], [0, 0.2, 1]]What is the attention weight for the second word attending to the third word after applying softmax on its row?
Solution
Step 1: Extract the second row scores
The second word's scores are [0.5, 1, 0.2].Step 2: Apply softmax to these scores
Softmax formula: exp(score) / sum(exp(all scores)). Calculate exp(0.5)=1.65, exp(1)=2.72, exp(0.2)=1.22. Sum = 1.65+2.72+1.22=5.59. Attention weight for third word = 1.22/5.59 ≈ 0.218.Final Answer:
Approximately 0.21 -> Option AQuick Check:
Softmax normalizes scores to probabilities [OK]
- Forgetting to exponentiate scores
- Dividing by wrong sum
- Mixing row and column values
import numpy as np
def multi_head_attention(scores_list):
heads = []
for scores in scores_list:
weights = np.exp(scores) / np.sum(np.exp(scores))
heads.append(weights)
return np.mean(heads, axis=0)
scores_list = [np.array([1, 0, 2]), np.array([0, 1, 1])]
print(multi_head_attention(scores_list))What is the main bug in this code?
Solution
Step 1: Analyze softmax calculation
Softmax is correctly applied per head by dividing exp(scores) by sum of exp(scores).Step 2: Check output aggregation
The function averages the weights from each head, but multi-head attention should concatenate or combine heads differently, not average weights element-wise.Final Answer:
The function returns mean of weights instead of concatenating heads -> Option BQuick Check:
Multi-head attention combines heads, not averages weights [OK]
- Thinking averaging weights is correct
- Confusing softmax denominator
- Assuming input format is wrong
Solution
Step 1: Understand effect of increasing attention heads
More heads mean the model can look at different parts of the sentence simultaneously, capturing richer relationships.Step 2: Consider computational cost and accuracy
Increasing heads usually increases computation and memory needs but can improve understanding and accuracy.Final Answer:
The model can capture more diverse word relationships but may require more computation -> Option CQuick Check:
More heads = richer focus + more compute [OK]
- Assuming more heads always make model faster
- Thinking word order is ignored
- Believing model focuses only on part of sentence
