Computer Visionml~12 mins

Vision Transformer (ViT) in Computer Vision - Model Pipeline Trace

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Model Pipeline - Vision Transformer (ViT)

The Vision Transformer (ViT) model splits an image into small patches, turns them into a sequence, and uses a transformer to learn patterns for image classification.

Data Flow - 8 Stages

1Input Image

1 image x 224 height x 224 width x 3 channels→Original color image input→1 image x 224 x 224 x 3

A photo of a cat with RGB colors

↓

2Patch Extraction

1 x 224 x 224 x 3→Split image into 16x16 patches→1 x 196 patches x (16*16*3=768) features

Image split into 196 small square patches, each flattened to 768 numbers

↓

3Linear Projection

1 x 196 x 768→Project each patch to embedding space→1 x 196 x 768

Each patch converted to a 768-length vector

↓

4Add Position Embeddings

1 x 196 x 768→Add position info to each patch embedding→1 x 196 x 768

Patch vectors now include location info

↓

5Classification Token

1 x 196 x 768→Add special classification token to the sequence→1 x 197 x 768

Sequence with classification token prepended

↓

6Transformer Encoder

1 x 197 x 768→Process sequence with multi-head self-attention layers→1 x 197 x 768

Model learns relationships between patches and classification token

↓

7Classification Token Extraction

1 x 197 x 768→Extract classification token output for classification→1 x 768

Single vector representing whole image

↓

8MLP Head

1 x 768→Feedforward network to predict class probabilities→1 x 1000 (for ImageNet classes)

Output probabilities for 1000 classes

Training Trace - Epoch by Epoch


Loss
2.3 |*         
1.5 |  *       
0.9 |    *     
0.6 |      *   
0.45|       *  
    +----------
     1  5 10 15 20 Epochs

Epoch	Loss ↓	Accuracy ↑	Observation
1	2.30	0.12	Starting training, loss high, accuracy low
5	1.50	0.45	Model learning basic features, accuracy improving
10	0.90	0.70	Good progress, model captures complex patterns
15	0.60	0.82	Loss decreasing steadily, accuracy high
20	0.45	0.88	Training converging, model performs well

Prediction Trace - 5 Layers

Layer 1: Patch Extraction

Layer 2: Linear Projection + Position Embedding + Classification Token

Layer 3: Transformer Encoder

Layer 4: Classification Token Extraction

Layer 5: MLP Head

Model Quiz - 3 Questions

Test your understanding

What is the purpose of splitting the image into patches in ViT?

ATo reduce the image size for faster training

BTo increase the number of color channels

CTo convert the image into a sequence for the transformer

DTo remove noise from the image

Key Insight

Vision Transformer uses a sequence approach by splitting images into patches and applying transformer attention, enabling it to learn complex image patterns effectively, as shown by steady loss decrease and accuracy increase during training.

Practice

(1/5)

1. What is the main purpose of splitting an image into patches in a Vision Transformer (ViT)?

easy

A. To reduce the image size by cropping

B. To convert the image into smaller parts that the transformer can process as tokens

C. To apply convolution filters on each patch separately

D. To increase the image resolution for better detail

Vision Transformer (ViT) in Computer Vision - Model Pipeline Trace

Start learning this pattern below

Practice

Solution

Step 1: Understand ViT input processing

Step 2: Purpose of patch splitting

Final Answer:

Quick Check:

Solution

Step 1: Understand tensor concatenation dimension

Step 2: Correct concatenation syntax

Final Answer:

Quick Check:

Solution

Step 1: Calculate number of patches

Step 2: Determine patch_embeddings shape

Final Answer:

Quick Check:

Solution

Step 1: Check batch size compatibility

Step 2: Fix class_token shape

Final Answer:

Quick Check:

Solution

Step 1: Understand class token role

Step 2: Use in classification

Final Answer:

Quick Check: