Bird
Raised Fist0
Computer Visionml~10 mins

CLIP (vision-language model) in Computer Vision - Interactive Code Practice

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to load the CLIP model and preprocess function.

Computer Vision
import clip
import torch

model, preprocess = clip.load([1])
Drag options to blanks, or click blank then click option'
A"ViT-B/32"
B"ResNet50"
C"BERT"
D"GPT-2"
Attempts:
3 left
💡 Hint
Common Mistakes
Using a text-only model name like 'BERT' or 'GPT-2' which are not vision models.
Using a ResNet model name which is not the default for CLIP in this example.
2fill in blank
medium

Complete the code to tokenize the input text for CLIP.

Computer Vision
text = clip.tokenize([1])
Drag options to blanks, or click blank then click option'
A12345
B"a photo of a cat"
C['a', 'photo', 'of', 'a', 'cat']
DNone
Attempts:
3 left
💡 Hint
Common Mistakes
Passing a list of words instead of a full sentence string.
Passing a number or None which causes errors.
3fill in blank
hard

Fix the error in the code to move the image tensor to the correct device for CLIP inference.

Computer Vision
device = "cuda" if torch.cuda.is_available() else "cpu"
image_input = preprocess(image).unsqueeze(0).[1](device)
Drag options to blanks, or click blank then click option'
Adevice
Bcuda
Cto
Dcpu
Attempts:
3 left
💡 Hint
Common Mistakes
Using .cuda() directly without checking if CUDA is available.
Using .device which is an attribute, not a method.
4fill in blank
hard

Fill both blanks to compute image and text features and normalize them for similarity calculation.

Computer Vision
with torch.no_grad():
    image_features = model.encode_image([1])
    text_features = model.encode_text([2])

image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
Drag options to blanks, or click blank then click option'
Aimage_input
Btext_input
Cimage
Dtext
Attempts:
3 left
💡 Hint
Common Mistakes
Passing raw image or text variables instead of processed tensors.
Confusing variable names for inputs.
5fill in blank
hard

Fill all three blanks to calculate the similarity scores between image and text features and get the top matching text index.

Computer Vision
similarity = (100.0 * image_features @ [1].T).softmax(dim=-1)
top_prob, top_label = similarity[0].[2](dim=0)
print(f"Top matching text index: [3]")
Drag options to blanks, or click blank then click option'
Atext_features
Btopk
Cargmax
Dtop_label
Attempts:
3 left
💡 Hint
Common Mistakes
Using max instead of argmax which returns values not indices.
Using topk which returns multiple top values, not a single index.
Multiplying with image_features instead of text_features transpose.

Practice

(1/5)
1. What is the main purpose of the CLIP model in computer vision?
easy
A. To connect images and text by learning their relationship
B. To generate images from random noise
C. To classify images into fixed categories without text
D. To detect objects using bounding boxes only

Solution

  1. Step 1: Understand CLIP's design goal

    CLIP is designed to learn how images and text relate to each other, enabling it to match images with descriptions.
  2. Step 2: Compare options with CLIP's purpose

    Options A, B, and D describe other tasks like classification without text, image generation, or object detection, which are not CLIP's main function.
  3. Final Answer:

    To connect images and text by learning their relationship -> Option A
  4. Quick Check:

    CLIP links images and text = C [OK]
Hint: CLIP matches images with text descriptions [OK]
Common Mistakes:
  • Confusing CLIP with image generation models
  • Thinking CLIP only classifies images
  • Assuming CLIP detects objects with bounding boxes
2. Which of the following is the correct way to load a pre-trained CLIP model using Python's transformers library?
easy
A. model = transformers.CLIP('openai/clip-vit-base-patch32')
B. model = CLIP.from_pretrained('clip-base')
C. model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
D. model = load_clip('vit-base')

Solution

  1. Step 1: Recall the transformers library syntax

    The correct method to load a pre-trained model is using the class name with from_pretrained and the model identifier string.
  2. Step 2: Match options to correct syntax

    model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32') uses CLIPModel.from_pretrained with the correct model name. Others use incorrect class names or methods.
  3. Final Answer:

    model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32') -> Option C
  4. Quick Check:

    Use CLIPModel.from_pretrained() with model name [OK]
Hint: Use CLIPModel.from_pretrained('model-name') to load CLIP [OK]
Common Mistakes:
  • Using wrong class names like CLIP or transformers.CLIP
  • Missing from_pretrained method
  • Using incomplete or incorrect model identifiers
3. Given the following Python code snippet using CLIP, what will be the output type of image_features?
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

image = Image.new('RGB', (224, 224), color='red')
inputs = processor(images=image, return_tensors='pt')

outputs = model.get_image_features(**inputs)
image_features = outputs.detach().numpy()
medium
A. A numpy array representing the image embedding vector
B. A PIL Image object
C. A PyTorch tensor with gradients enabled
D. A string describing the image

Solution

  1. Step 1: Understand model.get_image_features output

    This method returns a PyTorch tensor representing the image embedding vector.
  2. Step 2: Analyze the conversion to numpy array

    Calling detach().numpy() converts the tensor to a numpy array without gradients, so the final type is a numpy array.
  3. Final Answer:

    A numpy array representing the image embedding vector -> Option A
  4. Quick Check:

    Image features output = numpy array [OK]
Hint: detach().numpy() converts tensor to numpy array [OK]
Common Mistakes:
  • Thinking output is still a tensor with gradients
  • Confusing image features with image object
  • Expecting a text description instead of embeddings
4. Identify the error in this CLIP usage code snippet and select the fix:
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

inputs = processor(text='a photo of a cat')
outputs = model.get_text_features(inputs)
medium
A. Change processor(text='a photo of a cat') to processor(text=['a photo of a cat'])
B. Change model.get_text_features(inputs) to model.get_text_features(**inputs)
C. Add return_tensors='pt' inside the processor call
D. Replace CLIPModel with CLIPTextModel

Solution

  1. Step 1: Check how model methods accept inputs

    CLIP model methods expect keyword arguments unpacked from the processor output, so **inputs is needed.
  2. Step 2: Identify the error and fix

    Passing inputs directly causes an error; changing to model.get_text_features(**inputs) fixes it.
  3. Final Answer:

    Change model.get_text_features(inputs) to model.get_text_features(**inputs) -> Option B
  4. Quick Check:

    Use **inputs to unpack processor output [OK]
Hint: Unpack processor output with ** when calling model methods [OK]
Common Mistakes:
  • Passing processor output without unpacking
  • Not using return_tensors='pt' in processor
  • Confusing CLIPModel with CLIPTextModel
5. You want to find the most relevant image from a list using CLIP given a text query. Which approach correctly combines image and text features to find the best match?
hard
A. Match images by comparing their file names with the text query
B. Compare raw pixel values of images with text token IDs directly
C. Use Euclidean distance between unnormalized image and text features without preprocessing
D. Compute cosine similarity between normalized image and text feature vectors, then select the highest score

Solution

  1. Step 1: Understand CLIP feature comparison

    CLIP produces feature vectors for images and text; similarity is measured by cosine similarity after normalization.
  2. Step 2: Evaluate options for matching

    Compute cosine similarity between normalized image and text feature vectors, then select the highest score correctly uses cosine similarity on normalized vectors. Options B, C, and D use invalid or irrelevant methods.
  3. Final Answer:

    Compute cosine similarity between normalized image and text feature vectors, then select the highest score -> Option D
  4. Quick Check:

    Use cosine similarity on normalized features [OK]
Hint: Normalize features and use cosine similarity to match [OK]
Common Mistakes:
  • Comparing raw pixels with text tokens
  • Using Euclidean distance without normalization
  • Matching based on file names instead of features