Practice

(1/5)

1. What is the main purpose of the CLIP model in computer vision?

easy

A. To connect images and text by learning their relationship

B. To generate images from random noise

C. To classify images into fixed categories without text

D. To detect objects using bounding boxes only

Solution

Step 1: Understand CLIP's design goal
CLIP is designed to learn how images and text relate to each other, enabling it to match images with descriptions.
Step 2: Compare options with CLIP's purpose
Options A, B, and D describe other tasks like classification without text, image generation, or object detection, which are not CLIP's main function.
Final Answer:
To connect images and text by learning their relationship -> Option A
Quick Check:
CLIP links images and text = C [OK]

Hint: CLIP matches images with text descriptions [OK]

Common Mistakes:

Confusing CLIP with image generation models
Thinking CLIP only classifies images
Assuming CLIP detects objects with bounding boxes

2. Which of the following is the correct way to load a pre-trained CLIP model using Python's transformers library?

easy

A. model = transformers.CLIP('openai/clip-vit-base-patch32')

B. model = CLIP.from_pretrained('clip-base')

C. model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')

D. model = load_clip('vit-base')

Solution

Step 1: Recall the transformers library syntax
The correct method to load a pre-trained model is using the class name with from_pretrained and the model identifier string.
Step 2: Match options to correct syntax
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32') uses CLIPModel.from_pretrained with the correct model name. Others use incorrect class names or methods.
Final Answer:
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32') -> Option C
Quick Check:
Use CLIPModel.from_pretrained() with model name [OK]

Hint: Use CLIPModel.from_pretrained('model-name') to load CLIP [OK]

Common Mistakes:

Using wrong class names like CLIP or transformers.CLIP
Missing from_pretrained method
Using incomplete or incorrect model identifiers

3. Given the following Python code snippet using CLIP, what will be the output type of image_features?

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

image = Image.new('RGB', (224, 224), color='red')
inputs = processor(images=image, return_tensors='pt')

outputs = model.get_image_features(**inputs)
image_features = outputs.detach().numpy()

medium

A. A numpy array representing the image embedding vector

B. A PIL Image object

C. A PyTorch tensor with gradients enabled

D. A string describing the image

Solution

Step 1: Understand model.get_image_features output
This method returns a PyTorch tensor representing the image embedding vector.
Step 2: Analyze the conversion to numpy array
Calling detach().numpy() converts the tensor to a numpy array without gradients, so the final type is a numpy array.
Final Answer:
A numpy array representing the image embedding vector -> Option A
Quick Check:
Image features output = numpy array [OK]

Hint: detach().numpy() converts tensor to numpy array [OK]

Common Mistakes:

Thinking output is still a tensor with gradients
Confusing image features with image object
Expecting a text description instead of embeddings

4. Identify the error in this CLIP usage code snippet and select the fix:

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

inputs = processor(text='a photo of a cat')
outputs = model.get_text_features(inputs)

medium

A. Change processor(text='a photo of a cat') to processor(text=['a photo of a cat'])

B. Change model.get_text_features(inputs) to model.get_text_features(**inputs)

C. Add return_tensors='pt' inside the processor call

D. Replace CLIPModel with CLIPTextModel

Solution

Step 1: Check how model methods accept inputs
CLIP model methods expect keyword arguments unpacked from the processor output, so **inputs is needed.
Step 2: Identify the error and fix
Passing inputs directly causes an error; changing to model.get_text_features(**inputs) fixes it.
Final Answer:
Change model.get_text_features(inputs) to model.get_text_features(**inputs) -> Option B
Quick Check:
Use **inputs to unpack processor output [OK]

Hint: Unpack processor output with ** when calling model methods [OK]

Common Mistakes:

Passing processor output without unpacking
Not using return_tensors='pt' in processor
Confusing CLIPModel with CLIPTextModel

5. You want to find the most relevant image from a list using CLIP given a text query. Which approach correctly combines image and text features to find the best match?

hard

A. Match images by comparing their file names with the text query

B. Compare raw pixel values of images with text token IDs directly

C. Use Euclidean distance between unnormalized image and text features without preprocessing

D. Compute cosine similarity between normalized image and text feature vectors, then select the highest score

Solution

Step 1: Understand CLIP feature comparison
CLIP produces feature vectors for images and text; similarity is measured by cosine similarity after normalization.
Step 2: Evaluate options for matching
Compute cosine similarity between normalized image and text feature vectors, then select the highest score correctly uses cosine similarity on normalized vectors. Options B, C, and D use invalid or irrelevant methods.
Final Answer:
Compute cosine similarity between normalized image and text feature vectors, then select the highest score -> Option D
Quick Check:
Use cosine similarity on normalized features [OK]

Hint: Normalize features and use cosine similarity to match [OK]

Common Mistakes:

Comparing raw pixels with text tokens
Using Euclidean distance without normalization
Matching based on file names instead of features

Epoch	Loss ↓	Accuracy ↑	Observation
1	2.3	0.12	High loss and low accuracy as model starts learning
5	1.1	0.45	Loss decreasing and accuracy improving steadily
10	0.6	0.70	Model learning meaningful image-text relations
15	0.4	0.82	Good convergence with high accuracy
20	0.3	0.88	Loss low and accuracy high, model well trained

CLIP (vision-language model) in Computer Vision - Model Pipeline Trace

Start learning this pattern below

Practice

Solution

Step 1: Understand CLIP's design goal

Step 2: Compare options with CLIP's purpose

Final Answer:

Quick Check:

Solution

Step 1: Recall the transformers library syntax

Step 2: Match options to correct syntax

Final Answer:

Quick Check:

Solution

Step 1: Understand model.get_image_features output

Step 2: Analyze the conversion to numpy array

Final Answer:

Quick Check:

Solution

Step 1: Check how model methods accept inputs

Step 2: Identify the error and fix

Final Answer:

Quick Check:

Solution

Step 1: Understand CLIP feature comparison

Step 2: Evaluate options for matching

Final Answer:

Quick Check: