Recall & Review
beginner
What is the main idea behind the Vision Transformer (ViT)?
ViT splits an image into small patches and treats them like words in a sentence, then uses a Transformer model to learn from these patches for image recognition.
Click to reveal answer
intermediate
How does ViT process images differently from traditional convolutional neural networks (CNNs)?
ViT does not use convolution layers; instead, it divides images into patches and applies self-attention mechanisms to capture relationships between patches.
Click to reveal answer
intermediate
What role do positional embeddings play in Vision Transformer models?
Positional embeddings add information about the location of each image patch, helping the model understand the order and position of patches since Transformers do not have built-in spatial awareness.
Click to reveal answer
advanced
Why is pretraining on large datasets important for Vision Transformers?
ViTs need large amounts of data to learn good representations because they lack the built-in inductive biases of CNNs, so pretraining on big datasets helps them perform well on smaller tasks.
Click to reveal answer
beginner
What metric is commonly used to evaluate the performance of a Vision Transformer on image classification tasks?
Accuracy is commonly used, which measures the percentage of images correctly classified by the model.
Click to reveal answer
What does a Vision Transformer (ViT) use to represent an image before processing?
✗ Incorrect
ViT splits the image into small patches and treats each patch as a token for the Transformer.
Why are positional embeddings necessary in ViT models?
✗ Incorrect
Transformers do not know the order or position of tokens, so positional embeddings tell the model where each patch is located.
Which of the following is NOT a characteristic of Vision Transformers?
✗ Incorrect
ViTs do not use convolutional layers; they rely on self-attention mechanisms.
What is a common evaluation metric for ViT on classification tasks?
✗ Incorrect
Accuracy measures how many images are correctly classified, which is standard for classification.
How does ViT handle the spatial structure of images?
✗ Incorrect
ViT uses positional embeddings to keep track of where each patch is located in the image.
Explain how Vision Transformer (ViT) processes an image from input to output.
Think about how ViT treats image patches like words in a sentence.
You got /5 concepts.
Describe the advantages and challenges of using Vision Transformers compared to CNNs.
Consider data needs and model design differences.
You got /5 concepts.