0
0
Computer Visionml~20 mins

Vision Transformer (ViT) in Computer Vision - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Vision Transformer Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
What is the main purpose of the patch embedding step in a Vision Transformer?

In Vision Transformers, images are split into patches before processing. What is the main reason for this patch embedding step?

ATo perform data augmentation by randomly cropping image patches
BTo convert the image into a sequence of tokens suitable for transformer input
CTo reduce the image resolution to speed up convolution operations
DTo apply convolutional filters to extract local features
Attempts:
2 left
💡 Hint

Think about how transformers process data compared to convolutional neural networks.

Predict Output
intermediate
2:00remaining
Output shape after patch embedding in ViT

Given a batch of images with shape (batch_size=8, height=32, width=32, channels=3), and a patch size of 8, what is the shape of the patch embeddings after flattening and linear projection?

Computer Vision
import torch
import torch.nn as nn

batch_size = 8
img_size = 32
patch_size = 8
channels = 3

images = torch.randn(batch_size, channels, img_size, img_size)

num_patches = (img_size // patch_size) ** 2
patch_dim = channels * patch_size * patch_size

patches = images.unfold(2, patch_size, patch_size).unfold(3, patch_size, patch_size)
patches = patches.contiguous().view(batch_size, channels, -1, patch_size, patch_size)
patches = patches.permute(0, 2, 1, 3, 4).contiguous().view(batch_size, num_patches, patch_dim)

linear_proj = nn.Linear(patch_dim, 64)
patch_embeddings = linear_proj(patches)

output_shape = patch_embeddings.shape
A(8, 64, 16)
B(8, 16, 192)
C(8, 16, 64)
D(8, 32, 64)
Attempts:
2 left
💡 Hint

Calculate how many patches fit in the image and the embedding dimension after projection.

Model Choice
advanced
1:30remaining
Choosing the correct positional encoding for ViT

Which type of positional encoding is commonly used in Vision Transformers to help the model understand the order and position of image patches?

ALearnable positional embeddings added to patch embeddings
BFixed sinusoidal positional encodings like in original NLP transformers
CNo positional encoding is used in ViT
DPositional encoding applied via convolutional layers
Attempts:
2 left
💡 Hint

Consider whether the positional encoding is fixed or learned in ViT implementations.

Hyperparameter
advanced
1:30remaining
Effect of increasing the number of transformer layers in ViT

What is the most likely effect of increasing the number of transformer encoder layers in a Vision Transformer model?

AImproves model capacity and may increase accuracy but also increases training time and risk of overfitting
BDecreases model capacity and reduces accuracy due to vanishing gradients
CHas no effect on model performance or training time
DReduces the number of patches processed by the model
Attempts:
2 left
💡 Hint

Think about how deeper models affect learning and computation.

Metrics
expert
2:00remaining
Interpreting ViT training loss and accuracy curves

During training of a Vision Transformer on image classification, the training loss steadily decreases but the validation accuracy plateaus early and does not improve. What is the most likely explanation?

AThe batch size is too large causing poor gradient estimates
BThe model is underfitting and needs more training epochs
CThe learning rate is too high causing unstable training
DThe model is overfitting the training data and not generalizing well to validation data
Attempts:
2 left
💡 Hint

Consider what it means when training loss improves but validation accuracy stops improving.