Practice - 5 Tasks
Answer the questions below
1fill in blank
easyComplete the code to load the CLIP model and preprocess function.
Computer Vision
import clip import torch model, preprocess = clip.load([1])
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using a text-only model name like 'BERT' or 'GPT-2' which are not vision models.
Using a ResNet model name which is not the default for CLIP in this example.
✗ Incorrect
The CLIP model is loaded with the model name "ViT-B/32" which is a common vision transformer variant used in CLIP.
2fill in blank
mediumComplete the code to tokenize the input text for CLIP.
Computer Vision
text = clip.tokenize([1])
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Passing a list of words instead of a full sentence string.
Passing a number or None which causes errors.
✗ Incorrect
The clip.tokenize function expects a string or list of strings describing the text prompt.
3fill in blank
hardFix the error in the code to move the image tensor to the correct device for CLIP inference.
Computer Vision
device = "cuda" if torch.cuda.is_available() else "cpu" image_input = preprocess(image).unsqueeze(0).[1](device)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using .cuda() directly without checking if CUDA is available.
Using .device which is an attribute, not a method.
✗ Incorrect
The .to(device) method moves the tensor to the specified device (CPU or GPU).
4fill in blank
hardFill both blanks to compute image and text features and normalize them for similarity calculation.
Computer Vision
with torch.no_grad(): image_features = model.encode_image([1]) text_features = model.encode_text([2]) image_features = image_features / image_features.norm(dim=-1, keepdim=True) text_features = text_features / text_features.norm(dim=-1, keepdim=True)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Passing raw image or text variables instead of processed tensors.
Confusing variable names for inputs.
✗ Incorrect
The model encodes the preprocessed image tensor and tokenized text tensor, which are named image_input and text_input respectively.
5fill in blank
hardFill all three blanks to calculate the similarity scores between image and text features and get the top matching text index.
Computer Vision
similarity = (100.0 * image_features @ [1].T).softmax(dim=-1) top_prob, top_label = similarity[0].[2](dim=0) print(f"Top matching text index: [3]")
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using max instead of argmax which returns values not indices.
Using topk which returns multiple top values, not a single index.
Multiplying with image_features instead of text_features transpose.
✗ Incorrect
Similarity is computed by matrix multiplying image features with text features transpose. The top matching text index is found using argmax on similarity scores.