Computer Visionml~20 mins

Vision Transformer (ViT) in Computer Vision - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Challenge - 5 Problems

🎖️

Vision Transformer Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

1:30remaining

What is the main purpose of the patch embedding step in a Vision Transformer?

In Vision Transformers, images are split into patches before processing. What is the main reason for this patch embedding step?

ATo perform data augmentation by randomly cropping image patches

BTo convert the image into a sequence of tokens suitable for transformer input

CTo reduce the image resolution to speed up convolution operations

DTo apply convolutional filters to extract local features

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output shape after patch embedding in ViT

Given a batch of images with shape (batch_size=8, height=32, width=32, channels=3), and a patch size of 8, what is the shape of the patch embeddings after flattening and linear projection?

Computer Vision

import torch
import torch.nn as nn

batch_size = 8
img_size = 32
patch_size = 8
channels = 3

images = torch.randn(batch_size, channels, img_size, img_size)

num_patches = (img_size // patch_size) ** 2
patch_dim = channels * patch_size * patch_size

patches = images.unfold(2, patch_size, patch_size).unfold(3, patch_size, patch_size)
patches = patches.contiguous().view(batch_size, channels, -1, patch_size, patch_size)
patches = patches.permute(0, 2, 1, 3, 4).contiguous().view(batch_size, num_patches, patch_dim)

linear_proj = nn.Linear(patch_dim, 64)
patch_embeddings = linear_proj(patches)

output_shape = patch_embeddings.shape

A(8, 64, 16)

B(8, 16, 192)

C(8, 16, 64)

D(8, 32, 64)

Attempts:

2 left

❓ Model Choice

advanced

1:30remaining

Choosing the correct positional encoding for ViT

Which type of positional encoding is commonly used in Vision Transformers to help the model understand the order and position of image patches?

ALearnable positional embeddings added to patch embeddings

BFixed sinusoidal positional encodings like in original NLP transformers

CNo positional encoding is used in ViT

DPositional encoding applied via convolutional layers

Attempts:

2 left

❓ Hyperparameter

advanced

1:30remaining

Effect of increasing the number of transformer layers in ViT

What is the most likely effect of increasing the number of transformer encoder layers in a Vision Transformer model?

AImproves model capacity and may increase accuracy but also increases training time and risk of overfitting

BDecreases model capacity and reduces accuracy due to vanishing gradients

CHas no effect on model performance or training time

DReduces the number of patches processed by the model

Attempts:

2 left

❓ Metrics

expert

2:00remaining

Interpreting ViT training loss and accuracy curves

During training of a Vision Transformer on image classification, the training loss steadily decreases but the validation accuracy plateaus early and does not improve. What is the most likely explanation?

AThe batch size is too large causing poor gradient estimates

BThe model is underfitting and needs more training epochs

CThe learning rate is too high causing unstable training

DThe model is overfitting the training data and not generalizing well to validation data

Attempts:

2 left

Practice

(1/5)

1. What is the main purpose of splitting an image into patches in a Vision Transformer (ViT)?

easy

A. To reduce the image size by cropping

B. To convert the image into smaller parts that the transformer can process as tokens

C. To apply convolution filters on each patch separately

D. To increase the image resolution for better detail

Vision Transformer (ViT) in Computer Vision - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand ViT input processing

Step 2: Purpose of patch splitting

Final Answer:

Quick Check:

Solution

Step 1: Understand tensor concatenation dimension

Step 2: Correct concatenation syntax

Final Answer:

Quick Check:

Solution

Step 1: Calculate number of patches

Step 2: Determine patch_embeddings shape

Final Answer:

Quick Check:

Solution

Step 1: Check batch size compatibility

Step 2: Fix class_token shape

Final Answer:

Quick Check:

Solution

Step 1: Understand class token role

Step 2: Use in classification

Final Answer:

Quick Check: