0
0
Computer Visionml~12 mins

Vision Transformer (ViT) in Computer Vision - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - Vision Transformer (ViT)

The Vision Transformer (ViT) model splits an image into small patches, turns them into a sequence, and uses a transformer to learn patterns for image classification.

Data Flow - 8 Stages
1Input Image
1 image x 224 height x 224 width x 3 channelsOriginal color image input1 image x 224 x 224 x 3
A photo of a cat with RGB colors
2Patch Extraction
1 x 224 x 224 x 3Split image into 16x16 patches1 x 196 patches x (16*16*3=768) features
Image split into 196 small square patches, each flattened to 768 numbers
3Linear Projection
1 x 196 x 768Project each patch to embedding space1 x 196 x 768
Each patch converted to a 768-length vector
4Add Position Embeddings
1 x 196 x 768Add position info to each patch embedding1 x 196 x 768
Patch vectors now include location info
5Classification Token
1 x 196 x 768Add special classification token to the sequence1 x 197 x 768
Sequence with classification token prepended
6Transformer Encoder
1 x 197 x 768Process sequence with multi-head self-attention layers1 x 197 x 768
Model learns relationships between patches and classification token
7Classification Token Extraction
1 x 197 x 768Extract classification token output for classification1 x 768
Single vector representing whole image
8MLP Head
1 x 768Feedforward network to predict class probabilities1 x 1000 (for ImageNet classes)
Output probabilities for 1000 classes
Training Trace - Epoch by Epoch

Loss
2.3 |*         
1.5 |  *       
0.9 |    *     
0.6 |      *   
0.45|       *  
    +----------
     1  5 10 15 20 Epochs
EpochLoss ↓Accuracy ↑Observation
12.300.12Starting training, loss high, accuracy low
51.500.45Model learning basic features, accuracy improving
100.900.70Good progress, model captures complex patterns
150.600.82Loss decreasing steadily, accuracy high
200.450.88Training converging, model performs well
Prediction Trace - 5 Layers
Layer 1: Patch Extraction
Layer 2: Linear Projection + Position Embedding + Classification Token
Layer 3: Transformer Encoder
Layer 4: Classification Token Extraction
Layer 5: MLP Head
Model Quiz - 3 Questions
Test your understanding
What is the purpose of splitting the image into patches in ViT?
ATo reduce the image size for faster training
BTo increase the number of color channels
CTo convert the image into a sequence for the transformer
DTo remove noise from the image
Key Insight
Vision Transformer uses a sequence approach by splitting images into patches and applying transformer attention, enabling it to learn complex image patterns effectively, as shown by steady loss decrease and accuracy increase during training.