0
0
NLPml~7 mins

Attention mechanism in depth in NLP

Choose your learning style9 modes available
Introduction

Attention helps a model focus on important parts of the input when making decisions. It improves understanding by weighing useful information more.

Translating a sentence from one language to another, where some words depend on others far away.
Summarizing a long article by focusing on key sentences.
Answering questions by looking at relevant parts of a text.
Generating captions for images by focusing on important image regions.
Speech recognition where certain sounds matter more for words.
Syntax
NLP
Attention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V

Q = Query, K = Key, V = Value are matrices derived from input data.

softmax normalizes scores to probabilities, highlighting important parts.

Examples
Simple example showing how query matches keys and weights values accordingly.
NLP
Q = [[1, 0]]
K = [[1, 0], [0, 1]]
V = [[1, 2], [3, 4]]

scores = Q @ K.T / (2 ** 0.5)
weights = softmax(scores)
output = weights @ V
PyTorch code to compute attention output with softmax weights.
NLP
import torch

Q = torch.tensor([[1., 0.]])
K = torch.tensor([[1., 0.], [0., 1.]])
V = torch.tensor([[1., 2.], [3., 4.]])

scores = torch.matmul(Q, K.T) / (2 ** 0.5)
weights = torch.nn.functional.softmax(scores, dim=-1)
output = torch.matmul(weights, V)
print(output)
Sample Model

This program shows how attention scores are computed, normalized, and used to get a weighted sum of values. It uses simple tensors to demonstrate the core idea.

NLP
import torch
import torch.nn.functional as F

# Define Query, Key, Value tensors
Q = torch.tensor([[1., 0., 1.]])  # Query vector
K = torch.tensor([[1., 0., 1.], [0., 1., 0.], [1., 1., 0.]])  # Key vectors
V = torch.tensor([[1., 2.], [3., 4.], [5., 6.]])  # Value vectors

d_k = Q.size(-1)  # dimension of key

# Calculate scaled dot-product attention
scores = torch.matmul(Q, K.T) / (d_k ** 0.5)  # shape: (1, 3)
weights = F.softmax(scores, dim=-1)  # shape: (1, 3)
output = torch.matmul(weights, V)  # shape: (1, 2)

print(f"Scores: {scores}")
print(f"Weights (attention probabilities): {weights}")
print(f"Output (weighted sum of values): {output}")
OutputSuccess
Important Notes

Attention scores measure how well each key matches the query.

Scaling by sqrt(d_k) prevents large dot products that hurt learning.

Softmax turns scores into probabilities that sum to 1.

Summary

Attention helps models focus on important parts of input data.

It uses queries, keys, and values to compute weighted sums.

Softmax normalizes scores to highlight relevant information.