Attention helps a model focus on important parts of input when making decisions. It works like how we pay attention to key words in a sentence to understand its meaning.
Attention mechanism basics in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
attention_scores = query @ key.T / sqrt(d_k) attention_weights = softmax(attention_scores) output = attention_weights @ value
query, key, and value are vectors or matrices representing parts of the input.
The division by sqrt(d_k) helps keep the scores stable.
Examples
NLP
import torch import torch.nn.functional as F query = torch.tensor([[1., 0., 1.]]) key = torch.tensor([[1., 0., 0.], [0., 1., 0.]]) value = torch.tensor([[1., 2.], [3., 4.]]) scores = query @ key.T / (3 ** 0.5) weights = F.softmax(scores, dim=1) output = weights @ value print(output)
NLP
import numpy as np def softmax(x): e_x = np.exp(x - np.max(x)) return e_x / e_x.sum(axis=-1, keepdims=True) query = np.array([1, 0, 1]) key = np.array([[1, 0, 0], [0, 1, 0]]) value = np.array([[1, 2], [3, 4]]) scores = query @ key.T / np.sqrt(3) weights = softmax(scores) output = weights @ value print(output)
Sample Model
This program shows a simple attention mechanism calculation step-by-step using PyTorch. It prints the scores, weights, and final output vector.
NLP
import torch import torch.nn.functional as F # Define query, key, value tensors query = torch.tensor([[1., 0., 1.]]) # shape (1, 3) key = torch.tensor([[1., 0., 0.], [0., 1., 0.]]) # shape (2, 3) value = torch.tensor([[1., 2.], [3., 4.]]) # shape (2, 2) d_k = query.size(-1) # dimension of key vectors # Calculate attention scores scores = query @ key.T / (d_k ** 0.5) # shape (1, 2) # Apply softmax to get attention weights weights = F.softmax(scores, dim=1) # shape (1, 2) # Multiply weights by values to get output output = weights @ value # shape (1, 2) print(f"Attention scores: {scores}") print(f"Attention weights: {weights}") print(f"Output: {output}")
Important Notes
Attention helps models decide what to focus on, improving understanding.
Softmax turns scores into probabilities that add up to 1.
Query, key, and value come from the input data or previous layers.
Summary
Attention finds important parts of input to focus on.
It uses query, key, and value vectors to calculate weighted outputs.
Softmax makes scores into weights that sum to one.
Practice
1. What is the main purpose of the attention mechanism in NLP models?
easy
Solution
Step 1: Understand the role of attention
Attention helps the model decide which parts of the input are important to look at when making predictions.Step 2: Compare options with the concept
Only To focus on important parts of the input data correctly describes this focus on important input parts.Final Answer:
To focus on important parts of the input data -> Option BQuick Check:
Attention = Focus on important input [OK]
Hint: Attention means focusing on key input parts [OK]
Common Mistakes:
- Thinking attention increases input size
- Confusing attention with model depth
- Assuming attention shuffles data
2. Which of the following correctly represents the formula to compute attention weights using query (Q) and key (K) vectors?
easy
Solution
Step 1: Recall attention weight calculation
Attention weights are computed by taking the dot product of query and key vectors, then applying softmax.Step 2: Match formula to options
Softmax(Q x K^T) shows softmax applied to Q multiplied by the transpose of K, which is correct.Final Answer:
Softmax(Q x K^T) -> Option DQuick Check:
Attention weights = softmax(dot product) [OK]
Hint: Attention weights = softmax of query-key dot product [OK]
Common Mistakes:
- Adding Q and K instead of dot product
- Using ReLU or Sigmoid instead of softmax
- Ignoring transpose on key vector
3. Given query vector Q = [1, 0], key vectors K1 = [1, 0], K2 = [0, 1], and value vectors V1 = [10, 0], V2 = [0, 20], what is the attention output after applying softmax on Q·K^T and multiplying by values?
medium
Solution
Step 1: Calculate dot products Q·K1 and Q·K2
Q·K1 = 1*1 + 0*0 = 1; Q·K2 = 1*0 + 0*1 = 0.Step 2: Apply softmax to [1, 0]
Softmax(1,0) = [e^1/(e^1+e^0), e^0/(e^1+e^0)] ≈ [0.731, 0.269].Step 3: Multiply weights by values and sum
Output = 0.731*[10,0] + 0.269*[0,20] = [7.31, 0] + [0,5.38] = [7.31, 5.38].Step 4: Match to options
The computed output [7.31, 5.38] matches [7.31, 5.38] (approximate values).Final Answer:
[7.31, 5.38] -> Option CQuick Check:
Softmax weights x values = output [OK]
Hint: Softmax weights times values gives attention output [OK]
Common Mistakes:
- Skipping softmax normalization
- Multiplying query with values directly
- Ignoring vector multiplication order
4. Identify the error in this attention weight calculation code snippet:
import numpy as np Q = np.array([1, 2]) K = np.array([[1, 0], [0, 1]]) scores = np.dot(Q, K) weights = np.exp(scores) / np.sum(np.exp(scores))
medium
Solution
Step 1: Check dot product dimensions
Q is shape (2,), K is (2,2). np.dot(Q, K) results in shape (2,), but attention needs dot product with K transpose.Step 2: Correct dot product usage
Dot product should be np.dot(Q, K.T) to get scores for each key vector.Final Answer:
Dot product should be between Q and K transpose -> Option AQuick Check:
Dot product with K transpose needed [OK]
Hint: Dot product query with key transpose for scores [OK]
Common Mistakes:
- Using K instead of K transpose
- Miscomputing softmax manually
- Swapping Q and K incorrectly
5. In a transformer model, why is scaling the dot product by the square root of the key dimension important before applying softmax?
hard
Solution
Step 1: Understand dot product scaling
Without scaling, large dot product values can make softmax outputs very close to 0 or 1, causing gradients to vanish during training.Step 2: Purpose of scaling by sqrt of key dimension
Scaling reduces the magnitude of dot products, keeping softmax outputs more balanced and gradients healthy.Final Answer:
To prevent large dot product values causing softmax to produce very small gradients -> Option AQuick Check:
Scaling avoids gradient vanishing in softmax [OK]
Hint: Scale dot product to keep softmax gradients stable [OK]
Common Mistakes:
- Thinking scaling increases dot product values
- Believing scaling normalizes queries only
- Assuming scaling reduces keys processed
