For Vision Transformers used in image classification, accuracy is the main metric. It tells us how many images the model labels correctly out of all images. However, when classes are uneven or some mistakes cost more, precision, recall, and F1 score become important. These metrics help us understand if the model is good at finding certain classes or avoiding wrong guesses.
Vision Transformer (ViT) in Computer Vision - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
| Predicted Cat | Predicted Dog |
|--------------|---------------|
| True Cat: 45 | False Dog: 5 |
| False Cat: 3 | True Dog: 47 |
Total samples = 45 + 5 + 3 + 47 = 100
Precision (Cat) = TP / (TP + FP) = 45 / (45 + 3) = 0.9375
Recall (Cat) = TP / (TP + FN) = 45 / (45 + 5) = 0.9
Imagine ViT is used to detect rare animals in photos. If we want to be sure when the model says "animal found," we need high precision. This means fewer false alarms. But if missing any rare animal is bad, we want high recall to catch as many as possible, even if some guesses are wrong.
Choosing between precision and recall depends on the task. For example, in medical image analysis, missing a disease (low recall) is worse than a false alarm (low precision). For general object recognition, balanced metrics like F1 score help.
Good: Accuracy above 85% on a balanced dataset means the ViT is learning well. Precision and recall above 80% show it finds and labels classes reliably.
Bad: Accuracy near random chance (e.g., 10% for 10 classes) means the model is not learning. Very high accuracy but low recall means it misses many true cases. Low precision means many wrong guesses.
- Accuracy paradox: High accuracy can hide poor performance if classes are imbalanced.
- Data leakage: If test images are too similar to training, metrics look better but model won't generalize.
- Overfitting: Very high training accuracy but low test accuracy means the model memorizes training images, not learning patterns.
Your ViT model has 98% accuracy but only 12% recall on a rare class. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most true cases of the rare class (low recall), which can be critical depending on the task. High accuracy is misleading because the rare class is small compared to others.
Practice
Solution
Step 1: Understand ViT input processing
ViT splits images into fixed-size patches to treat each patch like a word token in language models.Step 2: Purpose of patch splitting
This allows the transformer to process image patches as a sequence, enabling attention mechanisms to learn relationships.Final Answer:
To convert the image into smaller parts that the transformer can process as tokens -> Option BQuick Check:
Image patches = tokens for transformer [OK]
- Confusing patch splitting with image resizing
- Thinking patches are processed by convolution
- Assuming patches increase image resolution
Solution
Step 1: Understand tensor concatenation dimension
Patch embeddings are sequences along dimension 1 (batch, seq, embed); class token must be prepended along this dimension.Step 2: Correct concatenation syntax
Using torch.cat with dim=1 adds class_token at the start of the sequence correctly.Final Answer:
patches = torch.cat([class_token, patches], dim=1) -> Option AQuick Check:
Class token prepended along sequence dim = patches = torch.cat([class_token, patches], dim=1) [OK]
- Concatenating along wrong dimension (dim=0)
- Appending class token at the end instead of start
- Mixing order of tensors in concat
patch_embeddings after processing a batch of 8 images of size 32x32 with patch size 8 and embedding dimension 64?patch_size = 8 embedding_dim = 64 batch_size = 8 image_size = 32 num_patches = (image_size // patch_size) ** 2 patch_embeddings = torch.randn(batch_size, num_patches, embedding_dim)
Solution
Step 1: Calculate number of patches
Number of patches = (32 / 8)^2 = 4^2 = 16 patches per image.Step 2: Determine patch_embeddings shape
Shape is (batch_size, num_patches, embedding_dim) = (8, 16, 64).Final Answer:
(8, 16, 64) -> Option DQuick Check:
Batch=8, patches=16, embed=64 [OK]
- Mixing embedding dimension and patch count order
- Calculating patches incorrectly
- Confusing batch size with patch count
class_token = torch.randn(1, 1, 64) patches = torch.randn(8, 16, 64) input_seq = torch.cat([class_token, patches], dim=1)
What is the cause of the error?
Solution
Step 1: Check batch size compatibility
class_token has batch size 1, patches have batch size 8; they must match for concatenation.Step 2: Fix class_token shape
class_token should be repeated or created with shape (8, 1, 64) to match patches batch size.Final Answer:
class_token shape should be (8, 1, 64) to match batch size -> Option CQuick Check:
Batch sizes must match for concat [OK]
- Ignoring batch size mismatch
- Changing wrong concat dimension
- Assuming embedding dims cause error
Solution
Step 1: Understand class token role
The class token is a special token that attends to all patch tokens and gathers their information.Step 2: Use in classification
After transformer layers, the class token embedding is used as the image's summary representation for classification.Final Answer:
It aggregates information from all patches via attention to produce a final image representation -> Option AQuick Check:
Class token = image summary for classification [OK]
- Confusing class token with positional encoding
- Thinking class token applies convolution
- Assuming class token normalizes embeddings
