Attention helps a model focus on important parts of the input when making decisions. It improves understanding by weighing useful information more.
0
0
Attention mechanism in depth in NLP
Introduction
Translating a sentence from one language to another, where some words depend on others far away.
Summarizing a long article by focusing on key sentences.
Answering questions by looking at relevant parts of a text.
Generating captions for images by focusing on important image regions.
Speech recognition where certain sounds matter more for words.
Syntax
NLP
Attention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V
Q = Query, K = Key, V = Value are matrices derived from input data.
softmax normalizes scores to probabilities, highlighting important parts.
Examples
Simple example showing how query matches keys and weights values accordingly.
NLP
Q = [[1, 0]] K = [[1, 0], [0, 1]] V = [[1, 2], [3, 4]] scores = Q @ K.T / (2 ** 0.5) weights = softmax(scores) output = weights @ V
PyTorch code to compute attention output with softmax weights.
NLP
import torch Q = torch.tensor([[1., 0.]]) K = torch.tensor([[1., 0.], [0., 1.]]) V = torch.tensor([[1., 2.], [3., 4.]]) scores = torch.matmul(Q, K.T) / (2 ** 0.5) weights = torch.nn.functional.softmax(scores, dim=-1) output = torch.matmul(weights, V) print(output)
Sample Model
This program shows how attention scores are computed, normalized, and used to get a weighted sum of values. It uses simple tensors to demonstrate the core idea.
NLP
import torch import torch.nn.functional as F # Define Query, Key, Value tensors Q = torch.tensor([[1., 0., 1.]]) # Query vector K = torch.tensor([[1., 0., 1.], [0., 1., 0.], [1., 1., 0.]]) # Key vectors V = torch.tensor([[1., 2.], [3., 4.], [5., 6.]]) # Value vectors d_k = Q.size(-1) # dimension of key # Calculate scaled dot-product attention scores = torch.matmul(Q, K.T) / (d_k ** 0.5) # shape: (1, 3) weights = F.softmax(scores, dim=-1) # shape: (1, 3) output = torch.matmul(weights, V) # shape: (1, 2) print(f"Scores: {scores}") print(f"Weights (attention probabilities): {weights}") print(f"Output (weighted sum of values): {output}")
OutputSuccess
Important Notes
Attention scores measure how well each key matches the query.
Scaling by sqrt(d_k) prevents large dot products that hurt learning.
Softmax turns scores into probabilities that sum to 1.
Summary
Attention helps models focus on important parts of input data.
It uses queries, keys, and values to compute weighted sums.
Softmax normalizes scores to highlight relevant information.