0
0
Computer Visionml~10 mins

CLIP (vision-language model) in Computer Vision - Interactive Code Practice

Choose your learning style9 modes available
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to load the CLIP model and preprocess function.

Computer Vision
import clip
import torch

model, preprocess = clip.load([1])
Drag options to blanks, or click blank then click option'
A"ViT-B/32"
B"ResNet50"
C"BERT"
D"GPT-2"
Attempts:
3 left
💡 Hint
Common Mistakes
Using a text-only model name like 'BERT' or 'GPT-2' which are not vision models.
Using a ResNet model name which is not the default for CLIP in this example.
2fill in blank
medium

Complete the code to tokenize the input text for CLIP.

Computer Vision
text = clip.tokenize([1])
Drag options to blanks, or click blank then click option'
A12345
B"a photo of a cat"
C['a', 'photo', 'of', 'a', 'cat']
DNone
Attempts:
3 left
💡 Hint
Common Mistakes
Passing a list of words instead of a full sentence string.
Passing a number or None which causes errors.
3fill in blank
hard

Fix the error in the code to move the image tensor to the correct device for CLIP inference.

Computer Vision
device = "cuda" if torch.cuda.is_available() else "cpu"
image_input = preprocess(image).unsqueeze(0).[1](device)
Drag options to blanks, or click blank then click option'
Adevice
Bcuda
Cto
Dcpu
Attempts:
3 left
💡 Hint
Common Mistakes
Using .cuda() directly without checking if CUDA is available.
Using .device which is an attribute, not a method.
4fill in blank
hard

Fill both blanks to compute image and text features and normalize them for similarity calculation.

Computer Vision
with torch.no_grad():
    image_features = model.encode_image([1])
    text_features = model.encode_text([2])

image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
Drag options to blanks, or click blank then click option'
Aimage_input
Btext_input
Cimage
Dtext
Attempts:
3 left
💡 Hint
Common Mistakes
Passing raw image or text variables instead of processed tensors.
Confusing variable names for inputs.
5fill in blank
hard

Fill all three blanks to calculate the similarity scores between image and text features and get the top matching text index.

Computer Vision
similarity = (100.0 * image_features @ [1].T).softmax(dim=-1)
top_prob, top_label = similarity[0].[2](dim=0)
print(f"Top matching text index: [3]")
Drag options to blanks, or click blank then click option'
Atext_features
Btopk
Cargmax
Dtop_label
Attempts:
3 left
💡 Hint
Common Mistakes
Using max instead of argmax which returns values not indices.
Using topk which returns multiple top values, not a single index.
Multiplying with image_features instead of text_features transpose.