What is Attention mechanism basics in NLP?

NLPml~5 mins

Attention mechanism basics in NLP

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

Attention helps a model focus on important parts of input when making decisions. It works like how we pay attention to key words in a sentence to understand its meaning.

Translating a sentence from one language to another

Answering questions based on a paragraph of text

Summarizing a long article into a short summary

Recognizing objects in an image by focusing on important areas

Syntax

NLP

attention_scores = query @ key.T / sqrt(d_k)
attention_weights = softmax(attention_scores)
output = attention_weights @ value

query, key, and value are vectors or matrices representing parts of the input.

The division by sqrt(d_k) helps keep the scores stable.

Examples

This example shows how to calculate attention weights and output using simple vectors.

NLP

import torch
import torch.nn.functional as F

query = torch.tensor([[1., 0., 1.]])
key = torch.tensor([[1., 0., 0.], [0., 1., 0.]])
value = torch.tensor([[1., 2.], [3., 4.]])

scores = query @ key.T / (3 ** 0.5)
weights = F.softmax(scores, dim=1)
output = weights @ value
print(output)

This example uses numpy to do the same attention calculation without PyTorch.

NLP

import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=-1, keepdims=True)

query = np.array([1, 0, 1])
key = np.array([[1, 0, 0], [0, 1, 0]])
value = np.array([[1, 2], [3, 4]])

scores = query @ key.T / np.sqrt(3)
weights = softmax(scores)
output = weights @ value
print(output)

Sample Model

This program shows a simple attention mechanism calculation step-by-step using PyTorch. It prints the scores, weights, and final output vector.

NLP

import torch
import torch.nn.functional as F

# Define query, key, value tensors
query = torch.tensor([[1., 0., 1.]])  # shape (1, 3)
key = torch.tensor([[1., 0., 0.], [0., 1., 0.]])  # shape (2, 3)
value = torch.tensor([[1., 2.], [3., 4.]])  # shape (2, 2)

d_k = query.size(-1)  # dimension of key vectors

# Calculate attention scores
scores = query @ key.T / (d_k ** 0.5)  # shape (1, 2)

# Apply softmax to get attention weights
weights = F.softmax(scores, dim=1)  # shape (1, 2)

# Multiply weights by values to get output
output = weights @ value  # shape (1, 2)

print(f"Attention scores: {scores}")
print(f"Attention weights: {weights}")
print(f"Output: {output}")

OutputSuccess

Important Notes

Attention helps models decide what to focus on, improving understanding.

Softmax turns scores into probabilities that add up to 1.

Query, key, and value come from the input data or previous layers.

Summary

Attention finds important parts of input to focus on.

It uses query, key, and value vectors to calculate weighted outputs.

Softmax makes scores into weights that sum to one.

Practice

(1/5)

1. What is the main purpose of the attention mechanism in NLP models?

easy

A. To reduce the number of layers in the model

B. To focus on important parts of the input data

C. To increase the size of the input data

D. To randomly shuffle the input tokens

Attention mechanism basics in NLP

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of attention

Step 2: Compare options with the concept

Final Answer:

Quick Check:

Solution

Step 1: Recall attention weight calculation

Step 2: Match formula to options

Final Answer:

Quick Check:

Solution

Step 1: Calculate dot products Q·K1 and Q·K2

Step 2: Apply softmax to [1, 0]

Step 3: Multiply weights by values and sum

Step 4: Match to options

Final Answer:

Quick Check:

Solution

Step 1: Check dot product dimensions

Step 2: Correct dot product usage

Final Answer:

Quick Check:

Solution

Step 1: Understand dot product scaling

Step 2: Purpose of scaling by sqrt of key dimension

Final Answer:

Quick Check: