0
0
PyTorchml~12 mins

Transformer decoder in PyTorch - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - Transformer decoder

The Transformer decoder takes encoded information and previous outputs to predict the next word in a sequence. It helps machines understand and generate language step-by-step.

Data Flow - 7 Stages
1Input tokens
1 sequence x 5 tokensTokenize sentence into word pieces1 sequence x 5 tokens
[101, 2003, 1037, 2742, 102]
2Embedding lookup
1 sequence x 5 tokensConvert tokens to vectors1 sequence x 5 tokens x 512 features
[[0.1, -0.2, ..., 0.05], ..., [0.3, 0.0, ..., -0.1]]
3Positional encoding added
1 sequence x 5 tokens x 512 featuresAdd position info to embeddings1 sequence x 5 tokens x 512 features
Embedding vector + position vector for each token
4Masked multi-head self-attention
1 sequence x 5 tokens x 512 featuresFocus on previous tokens only1 sequence x 5 tokens x 512 features
Attention weights mask future tokens
5Encoder-decoder attention
1 sequence x 5 tokens x 512 features (decoder), 1 sequence x 7 tokens x 512 features (encoder output)Focus on encoder output relevant to decoder tokens1 sequence x 5 tokens x 512 features
Attention aligns decoder tokens with encoder info
6Feed-forward network
1 sequence x 5 tokens x 512 featuresApply two linear layers with ReLU1 sequence x 5 tokens x 512 features
Non-linear transformation of features
7Output linear + softmax
1 sequence x 5 tokens x 512 featuresProject to vocabulary size and normalize1 sequence x 5 tokens x 10000 vocabulary
[[0.01, 0.05, ..., 0.0001], ..., [0.02, 0.03, ..., 0.001]]
Training Trace - Epoch by Epoch
Loss
5.2 |*****
4.1 |****
3.3 |***
2.7 |**
2.2 |**
1.8 |*
1.5 |*
1.3 |*
1.1 |*
0.95|*
EpochLoss ↓Accuracy ↑Observation
15.20.12High loss and low accuracy at start
24.10.25Loss decreased, accuracy improved
33.30.38Model learning meaningful patterns
42.70.48Steady improvement in metrics
52.20.57Model converging well
61.80.65Good balance of loss and accuracy
71.50.71Further refinement of predictions
81.30.76Model nearing stable performance
91.10.80Strong accuracy, low loss
100.950.83Training converged well
Prediction Trace - 6 Layers
Layer 1: Input token embedding
Layer 2: Add positional encoding
Layer 3: Masked multi-head self-attention
Layer 4: Encoder-decoder attention
Layer 5: Feed-forward network
Layer 6: Output linear + softmax
Model Quiz - 3 Questions
Test your understanding
Why does the decoder use masked self-attention?
ATo prevent looking at future tokens during prediction
BTo speed up training by ignoring some tokens
CTo add position information to tokens
DTo combine encoder and decoder outputs
Key Insight
The Transformer decoder learns to predict the next word by focusing only on past words and relevant encoded information. Masking future tokens ensures predictions are made step-by-step, and training shows steady improvement in accuracy as loss decreases.