0
0
Computer Visionml~20 mins

CLIP (vision-language model) in Computer Vision - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
CLIP Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
How does CLIP learn to connect images and text?

CLIP is a model that understands images and text together. How does it learn this connection?

ABy training only on images to classify objects without any text input.
BBy clustering images based on color patterns without using text.
CBy generating text captions from images using a language-only model.
DBy training on pairs of images and their matching text descriptions to align their representations.
Attempts:
2 left
💡 Hint

Think about how the model sees both images and text at the same time during training.

Predict Output
intermediate
2:00remaining
What is the output shape of CLIP's image encoder?

Given a batch of 8 images, each resized to 224x224 pixels with 3 color channels, what is the shape of the output from CLIP's image encoder?

Computer Vision
import torch
from torchvision.transforms import Compose, Resize, ToTensor, Normalize

# Dummy batch of 8 images
images = torch.randn(8, 3, 224, 224)

# Assume clip_model is loaded and has an image encoder
# output = clip_model.encode_image(images)

# What is output.shape?
A(8, 3, 224, 224)
B(8, 512)
C(8, 1024)
D(224, 224, 3)
Attempts:
2 left
💡 Hint

The image encoder outputs a vector embedding per image, not an image tensor.

Model Choice
advanced
2:00remaining
Which architecture is used for CLIP's text encoder?

CLIP uses a specific type of neural network to process text input. Which one is it?

AA transformer-based model that processes sequences of words.
BA convolutional neural network (CNN) designed for images.
CA recurrent neural network (RNN) with LSTM units.
DA simple feedforward neural network with no sequence handling.
Attempts:
2 left
💡 Hint

Think about models good at understanding sequences and context in language.

Metrics
advanced
2:00remaining
Which metric best evaluates CLIP's zero-shot classification accuracy?

CLIP can classify images without training on specific classes. Which metric measures how well it does this?

ATop-1 accuracy comparing predicted labels to true labels.
BBLEU score measuring text generation quality.
CMean Squared Error (MSE) between image pixels and text tokens.
DPerplexity of the language model on text input.
Attempts:
2 left
💡 Hint

Think about how classification models are usually evaluated.

🔧 Debug
expert
3:00remaining
Why does CLIP's similarity score between image and text embeddings sometimes produce negative values?

CLIP computes similarity scores between image and text vectors. Sometimes these scores are negative. Why?

ABecause the image encoder outputs random noise vectors.
BBecause the model outputs raw logits that are always negative.
CBecause the embeddings are normalized and cosine similarity can range from -1 to 1.
DBecause the text encoder uses ReLU activations that produce negative values.
Attempts:
2 left
💡 Hint

Recall the range of cosine similarity values between two vectors.