Complete the code to import the Vision Transformer model from the torchvision library.
from torchvision.models import [1]
The Vision Transformer model in torchvision is imported as vit_b_16. Other options are different CNN models.
Complete the code to create a Vision Transformer model pretrained on ImageNet.
model = [1](pretrained=True)
To create a pretrained Vision Transformer, use vit_b_16(pretrained=True). Other options are different model architectures.
Fix the error in the code to correctly reshape the input image tensor for ViT patch embedding.
patches = x.unfold(2, [1], [1]).unfold(3, [1], [1])
ViT uses 16x16 patches, so the unfold size and step must be 16.
Fill both blanks to complete the code that applies the multi-head self-attention mechanism in ViT.
attention_output = self.attn(query, key, value, [1]=mask, [2]=True)
key_padding_mask instead of attn_mask for the mask parameter.batch_first=True causing shape errors.The attn_mask parameter applies the attention mask, and batch_first=True specifies the input shape format.
Fill all three blanks to complete the code that computes the classification output from the ViT model.
cls_token = x[:, [1]].unsqueeze(1) output = self.mlp_head(cls_token).squeeze([2]) loss = criterion(output, [3])
The classification token is at index 0, unsqueezed at dimension 1, and the loss is computed against the true labels.