0
0
Computer Visionml~5 mins

Vision Transformer (ViT) in Computer Vision - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is the main idea behind the Vision Transformer (ViT)?
ViT splits an image into small patches and treats them like words in a sentence, then uses a Transformer model to learn from these patches for image recognition.
Click to reveal answer
intermediate
How does ViT process images differently from traditional convolutional neural networks (CNNs)?
ViT does not use convolution layers; instead, it divides images into patches and applies self-attention mechanisms to capture relationships between patches.
Click to reveal answer
intermediate
What role do positional embeddings play in Vision Transformer models?
Positional embeddings add information about the location of each image patch, helping the model understand the order and position of patches since Transformers do not have built-in spatial awareness.
Click to reveal answer
advanced
Why is pretraining on large datasets important for Vision Transformers?
ViTs need large amounts of data to learn good representations because they lack the built-in inductive biases of CNNs, so pretraining on big datasets helps them perform well on smaller tasks.
Click to reveal answer
beginner
What metric is commonly used to evaluate the performance of a Vision Transformer on image classification tasks?
Accuracy is commonly used, which measures the percentage of images correctly classified by the model.
Click to reveal answer
What does a Vision Transformer (ViT) use to represent an image before processing?
ASmall patches of the image treated like tokens
BPixels directly as input vectors
CConvolutional filters applied to the whole image
DHistogram of gradients
Why are positional embeddings necessary in ViT models?
ATo provide spatial location information of patches
BTo reduce the size of the input
CTo add color information to patches
DTo increase the number of patches
Which of the following is NOT a characteristic of Vision Transformers?
ARequire large datasets for training
BUse of self-attention to capture relationships
CBuilt-in convolutional layers
DDivide images into patches
What is a common evaluation metric for ViT on classification tasks?
APerplexity
BMean squared error
CBLEU score
DAccuracy
How does ViT handle the spatial structure of images?
ABy using convolutional filters
BBy using positional embeddings
CBy flattening the image into a vector
DBy ignoring spatial information
Explain how Vision Transformer (ViT) processes an image from input to output.
Think about how ViT treats image patches like words in a sentence.
You got /5 concepts.
    Describe the advantages and challenges of using Vision Transformers compared to CNNs.
    Consider data needs and model design differences.
    You got /5 concepts.