Practice

(1/5)

1. What is the main purpose of splitting an image into patches in a Vision Transformer (ViT)?

easy

A. To reduce the image size by cropping

B. To convert the image into smaller parts that the transformer can process as tokens

C. To apply convolution filters on each patch separately

D. To increase the image resolution for better detail

Solution

Step 1: Understand ViT input processing
ViT splits images into fixed-size patches to treat each patch like a word token in language models.
Step 2: Purpose of patch splitting
This allows the transformer to process image patches as a sequence, enabling attention mechanisms to learn relationships.
Final Answer:
To convert the image into smaller parts that the transformer can process as tokens -> Option B
Quick Check:
Image patches = tokens for transformer [OK]

Hint: Think of patches as words in a sentence for the transformer [OK]

Common Mistakes:

Confusing patch splitting with image resizing
Thinking patches are processed by convolution
Assuming patches increase image resolution

2. Which of the following is the correct way to add a class token to the patch embeddings in ViT using Python-like pseudocode?

easy

A. patches = torch.cat([class_token, patches], dim=1)

B. patches = torch.cat([patches, class_token], dim=1)

C. patches = torch.cat([patches, class_token], dim=0)

D. patches = torch.cat([class_token, patches], dim=0)

Solution

Step 1: Understand tensor concatenation dimension
Patch embeddings are sequences along dimension 1 (batch, seq, embed); class token must be prepended along this dimension.
Step 2: Correct concatenation syntax
Using torch.cat with dim=1 adds class_token at the start of the sequence correctly.
Final Answer:
patches = torch.cat([class_token, patches], dim=1) -> Option A
Quick Check:
Class token prepended along sequence dim = patches = torch.cat([class_token, patches], dim=1) [OK]

Hint: Class token goes first, concat along sequence dimension (dim=1) [OK]

Common Mistakes:

Concatenating along wrong dimension (dim=0)
Appending class token at the end instead of start
Mixing order of tensors in concat

3. Given the following simplified ViT patch embedding code, what is the shape of patch_embeddings after processing a batch of 8 images of size 32x32 with patch size 8 and embedding dimension 64?

patch_size = 8
embedding_dim = 64
batch_size = 8
image_size = 32
num_patches = (image_size // patch_size) ** 2
patch_embeddings = torch.randn(batch_size, num_patches, embedding_dim)

medium

A. (16, 8, 64)

B. (8, 64, 16)

C. (8, 8, 64)

D. (8, 16, 64)

Solution

Step 1: Calculate number of patches
Number of patches = (32 / 8)^2 = 4^2 = 16 patches per image.
Step 2: Determine patch_embeddings shape
Shape is (batch_size, num_patches, embedding_dim) = (8, 16, 64).
Final Answer:
(8, 16, 64) -> Option D
Quick Check:
Batch=8, patches=16, embed=64 [OK]

Hint: Calculate patches as (image/patch)^2, then batch x patches x embed [OK]

Common Mistakes:

Mixing embedding dimension and patch count order
Calculating patches incorrectly
Confusing batch size with patch count

4. You have this ViT code snippet that throws an error:

class_token = torch.randn(1, 1, 64)
patches = torch.randn(8, 16, 64)
input_seq = torch.cat([class_token, patches], dim=1)

What is the cause of the error?

medium

A. Embedding dimensions do not match

B. Wrong concatenation dimension; should be dim=0

C. class_token shape should be (8, 1, 64) to match batch size

D. Dimension mismatch because class_token sequence size is 1 but patches sequence size is 16

Solution

Step 1: Check batch size compatibility
class_token has batch size 1, patches have batch size 8; they must match for concatenation.
Step 2: Fix class_token shape
class_token should be repeated or created with shape (8, 1, 64) to match patches batch size.
Final Answer:
class_token shape should be (8, 1, 64) to match batch size -> Option C
Quick Check:
Batch sizes must match for concat [OK]

Hint: Match batch sizes before concatenating tensors [OK]

Common Mistakes:

Ignoring batch size mismatch
Changing wrong concat dimension
Assuming embedding dims cause error

5. In a Vision Transformer model, why is the class token important for image classification tasks?

hard

A. It aggregates information from all patches via attention to produce a final image representation

B. It stores the positional information of patches

C. It applies convolution to patches before transformer layers

D. It normalizes the patch embeddings before feeding to the transformer

Solution

Step 1: Understand class token role
The class token is a special token that attends to all patch tokens and gathers their information.
Step 2: Use in classification
After transformer layers, the class token embedding is used as the image's summary representation for classification.
Final Answer:
It aggregates information from all patches via attention to produce a final image representation -> Option A
Quick Check:
Class token = image summary for classification [OK]

Hint: Class token collects info from patches for final decision [OK]

Common Mistakes:

Confusing class token with positional encoding
Thinking class token applies convolution
Assuming class token normalizes embeddings

Why Vision Transformer (ViT) in Computer Vision? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand ViT input processing

Step 2: Purpose of patch splitting

Final Answer:

Quick Check:

Solution

Step 1: Understand tensor concatenation dimension

Step 2: Correct concatenation syntax

Final Answer:

Quick Check:

Solution

Step 1: Calculate number of patches

Step 2: Determine patch_embeddings shape

Final Answer:

Quick Check:

Solution

Step 1: Check batch size compatibility

Step 2: Fix class_token shape

Final Answer:

Quick Check:

Solution

Step 1: Understand class token role

Step 2: Use in classification

Final Answer:

Quick Check: