Model Pipeline - Vision Transformer (ViT)
The Vision Transformer (ViT) model splits an image into small patches, turns them into a sequence, and uses a transformer to learn patterns for image classification.
Jump into concepts and practice - no test required
The Vision Transformer (ViT) model splits an image into small patches, turns them into a sequence, and uses a transformer to learn patterns for image classification.
Loss
2.3 |*
1.5 | *
0.9 | *
0.6 | *
0.45| *
+----------
1 5 10 15 20 Epochs
| Epoch | Loss ↓ | Accuracy ↑ | Observation |
|---|---|---|---|
| 1 | 2.30 | 0.12 | Starting training, loss high, accuracy low |
| 5 | 1.50 | 0.45 | Model learning basic features, accuracy improving |
| 10 | 0.90 | 0.70 | Good progress, model captures complex patterns |
| 15 | 0.60 | 0.82 | Loss decreasing steadily, accuracy high |
| 20 | 0.45 | 0.88 | Training converging, model performs well |
patch_embeddings after processing a batch of 8 images of size 32x32 with patch size 8 and embedding dimension 64?patch_size = 8 embedding_dim = 64 batch_size = 8 image_size = 32 num_patches = (image_size // patch_size) ** 2 patch_embeddings = torch.randn(batch_size, num_patches, embedding_dim)
class_token = torch.randn(1, 1, 64) patches = torch.randn(8, 16, 64) input_seq = torch.cat([class_token, patches], dim=1)