PyTorchml~12 mins

Transformer decoder in PyTorch - Model Pipeline Trace

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Model Pipeline - Transformer decoder

The Transformer decoder takes encoded information and previous outputs to predict the next word in a sequence. It helps machines understand and generate language step-by-step.

Data Flow - 7 Stages

1Input tokens

1 sequence x 5 tokens→Tokenize sentence into word pieces→1 sequence x 5 tokens

[101, 2003, 1037, 2742, 102]

↓

2Embedding lookup

1 sequence x 5 tokens→Convert tokens to vectors→1 sequence x 5 tokens x 512 features

[[0.1, -0.2, ..., 0.05], ..., [0.3, 0.0, ..., -0.1]]

↓

3Positional encoding added

1 sequence x 5 tokens x 512 features→Add position info to embeddings→1 sequence x 5 tokens x 512 features

Embedding vector + position vector for each token

↓

4Masked multi-head self-attention

1 sequence x 5 tokens x 512 features→Focus on previous tokens only→1 sequence x 5 tokens x 512 features

Attention weights mask future tokens

↓

5Encoder-decoder attention

1 sequence x 5 tokens x 512 features (decoder), 1 sequence x 7 tokens x 512 features (encoder output)→Focus on encoder output relevant to decoder tokens→1 sequence x 5 tokens x 512 features

Attention aligns decoder tokens with encoder info

↓

6Feed-forward network

1 sequence x 5 tokens x 512 features→Apply two linear layers with ReLU→1 sequence x 5 tokens x 512 features

Non-linear transformation of features

↓

7Output linear + softmax

1 sequence x 5 tokens x 512 features→Project to vocabulary size and normalize→1 sequence x 5 tokens x 10000 vocabulary

[[0.01, 0.05, ..., 0.0001], ..., [0.02, 0.03, ..., 0.001]]

Training Trace - Epoch by Epoch

Loss
5.2 |*****
4.1 |****
3.3 |***
2.7 |**
2.2 |**
1.8 |*
1.5 |*
1.3 |*
1.1 |*
0.95|*

Epoch	Loss ↓	Accuracy ↑	Observation
1	5.2	0.12	High loss and low accuracy at start
2	4.1	0.25	Loss decreased, accuracy improved
3	3.3	0.38	Model learning meaningful patterns
4	2.7	0.48	Steady improvement in metrics
5	2.2	0.57	Model converging well
6	1.8	0.65	Good balance of loss and accuracy
7	1.5	0.71	Further refinement of predictions
8	1.3	0.76	Model nearing stable performance
9	1.1	0.80	Strong accuracy, low loss
10	0.95	0.83	Training converged well

Prediction Trace - 6 Layers

Layer 1: Input token embedding

Layer 2: Add positional encoding

Layer 3: Masked multi-head self-attention

Layer 4: Encoder-decoder attention

Layer 5: Feed-forward network

Layer 6: Output linear + softmax

Model Quiz - 3 Questions

Test your understanding

Why does the decoder use masked self-attention?

ATo prevent looking at future tokens during prediction

BTo speed up training by ignoring some tokens

CTo add position information to tokens

DTo combine encoder and decoder outputs

Key Insight

The Transformer decoder learns to predict the next word by focusing only on past words and relevant encoded information. Masking future tokens ensures predictions are made step-by-step, and training shows steady improvement in accuracy as loss decreases.