Computer Visionml~10 mins

Vision Transformer (ViT) in Computer Vision - Interactive Code Practice

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Practice - 5 Tasks

Answer the questions below

1fill in blank

easy

Complete the code to import the Vision Transformer model from the torchvision library.

Computer Vision

from torchvision.models import [1]

Drag options to blanks, or click blank then click option'

Aalexnet

Bresnet50

Cvgg16

Dvit_b_16

Attempts:

3 left

2fill in blank

medium

Complete the code to create a Vision Transformer model pretrained on ImageNet.

Computer Vision

model = [1](pretrained=True)

Drag options to blanks, or click blank then click option'

Adensenet121

Bresnet50

Cvit_b_16

Dmobilenet_v2

Attempts:

3 left

3fill in blank

hard

Fix the error in the code to correctly reshape the input image tensor for ViT patch embedding.

Computer Vision

patches = x.unfold(2, [1], [1]).unfold(3, [1], [1])

Drag options to blanks, or click blank then click option'

B16

C32

D64

Attempts:

3 left

4fill in blank

hard

Fill both blanks to complete the code that applies the multi-head self-attention mechanism in ViT.

Computer Vision

attention_output = self.attn(query, key, value, [1]=mask, [2]=True)

Drag options to blanks, or click blank then click option'

Aattn_mask

Bkey_padding_mask

Cbatch_first

Ddropout

Attempts:

3 left

5fill in blank

hard

Fill all three blanks to complete the code that computes the classification output from the ViT model.

Computer Vision

cls_token = x[:, [1]].unsqueeze(1)
output = self.mlp_head(cls_token).squeeze([2])
loss = criterion(output, [3])

Drag options to blanks, or click blank then click option'

Clabels

Attempts:

3 left

Practice

(1/5)

1. What is the main purpose of splitting an image into patches in a Vision Transformer (ViT)?

easy

A. To reduce the image size by cropping

B. To convert the image into smaller parts that the transformer can process as tokens

C. To apply convolution filters on each patch separately

D. To increase the image resolution for better detail

Vision Transformer (ViT) in Computer Vision - Interactive Code Practice

Start learning this pattern below

Practice

Solution

Step 1: Understand ViT input processing

Step 2: Purpose of patch splitting

Final Answer:

Quick Check:

Solution

Step 1: Understand tensor concatenation dimension

Step 2: Correct concatenation syntax

Final Answer:

Quick Check:

Solution

Step 1: Calculate number of patches

Step 2: Determine patch_embeddings shape

Final Answer:

Quick Check:

Solution

Step 1: Check batch size compatibility

Step 2: Fix class_token shape

Final Answer:

Quick Check:

Solution

Step 1: Understand class token role

Step 2: Use in classification

Final Answer:

Quick Check: