In Vision Transformers, images are split into patches before processing. What is the main reason for this patch embedding step?
Think about how transformers process data compared to convolutional neural networks.
Vision Transformers split images into patches and flatten them to create a sequence of tokens. This sequence format is required because transformers are designed to process sequential data, like words in a sentence.
Given a batch of images with shape (batch_size=8, height=32, width=32, channels=3), and a patch size of 8, what is the shape of the patch embeddings after flattening and linear projection?
import torch import torch.nn as nn batch_size = 8 img_size = 32 patch_size = 8 channels = 3 images = torch.randn(batch_size, channels, img_size, img_size) num_patches = (img_size // patch_size) ** 2 patch_dim = channels * patch_size * patch_size patches = images.unfold(2, patch_size, patch_size).unfold(3, patch_size, patch_size) patches = patches.contiguous().view(batch_size, channels, -1, patch_size, patch_size) patches = patches.permute(0, 2, 1, 3, 4).contiguous().view(batch_size, num_patches, patch_dim) linear_proj = nn.Linear(patch_dim, 64) patch_embeddings = linear_proj(patches) output_shape = patch_embeddings.shape
Calculate how many patches fit in the image and the embedding dimension after projection.
The image is split into (32/8)^2 = 16 patches. Each patch is flattened to size 3*8*8=192. The linear layer projects this to 64 dimensions. So the output shape is (batch_size, 16, 64).
Which type of positional encoding is commonly used in Vision Transformers to help the model understand the order and position of image patches?
Consider whether the positional encoding is fixed or learned in ViT implementations.
Vision Transformers typically use learnable positional embeddings that are added to the patch embeddings. This allows the model to learn the best way to represent position information during training.
What is the most likely effect of increasing the number of transformer encoder layers in a Vision Transformer model?
Think about how deeper models affect learning and computation.
Adding more transformer layers increases the model's ability to learn complex patterns but also requires more computation and can lead to overfitting if not managed properly.
During training of a Vision Transformer on image classification, the training loss steadily decreases but the validation accuracy plateaus early and does not improve. What is the most likely explanation?
Consider what it means when training loss improves but validation accuracy stops improving.
If training loss decreases but validation accuracy stops improving, the model is likely memorizing training data patterns but failing to generalize, indicating overfitting.