NLPml~12 mins

Self-attention and multi-head attention in NLP - Model Pipeline Trace

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Model Pipeline - Self-attention and multi-head attention

This pipeline shows how self-attention and multi-head attention help a model understand relationships between words in a sentence. It transforms input words into meaningful features, learns patterns during training, and then predicts context-aware word representations.

Data Flow - 5 Stages

1Input tokens

1 sentence x 6 words→Convert words to token IDs→1 sentence x 6 tokens

["The", "cat", "sat", "on", "the", "mat"] -> [101, 4937, 1037, 2006, 101, 3899]

↓

2Embedding lookup

1 sentence x 6 tokens→Map tokens to vectors→1 sentence x 6 tokens x 64 features

[101, 4937, 1037, 2006, 101, 3899] -> [[0.1,0.3,...], [0.2,0.1,...], ...]

↓

3Self-attention calculation

1 sentence x 6 tokens x 64 features→Compute attention scores between all tokens→1 sentence x 6 tokens x 64 features

Each token attends to all tokens, e.g., 'cat' attends to 'sat' and 'mat'

↓

4Multi-head attention

1 sentence x 6 tokens x 64 features→Split features into 4 heads, apply self-attention separately, then combine→1 sentence x 6 tokens x 64 features

4 heads each with 16 features, combined back to 64 features

↓

5Output representation

1 sentence x 6 tokens x 64 features→Final context-aware token vectors→1 sentence x 6 tokens x 64 features

Each token vector now includes info from related words

Training Trace - Epoch by Epoch


Loss
1.2 |*       
1.0 | **     
0.8 |  ***   
0.6 |   **** 
0.4 |    *****
     --------
     Epochs

Epoch	Loss ↓	Accuracy ↑	Observation
1	1.2	0.45	Model starts learning basic word relationships
2	0.9	0.60	Attention weights improve, capturing more context
3	0.7	0.72	Multi-head attention helps model focus on different word aspects
4	0.5	0.80	Loss decreases steadily, accuracy rises as context understanding improves
5	0.4	0.85	Model converges with strong attention patterns

Prediction Trace - 5 Layers

Layer 1: Embedding lookup

Layer 2: Self-attention scores

Layer 3: Multi-head attention

Layer 4: Weighted sum of values

Layer 5: Final output

Model Quiz - 3 Questions

Test your understanding

What does self-attention help the model do?

ATranslate words into another language

BUnderstand relationships between all words in a sentence

CReduce the number of words in a sentence

DGenerate random word sequences

Key Insight

Self-attention allows the model to weigh the importance of each word relative to others, capturing context effectively. Multi-head attention enhances this by learning multiple perspectives at once, improving understanding and prediction quality.