Practice - 5 Tasks

Answer the questions below

1fill in blank

easy

Complete the code to load the CLIP model and preprocess function.

Computer Vision

import clip
import torch

model, preprocess = clip.load([1])

Drag options to blanks, or click blank then click option'

A"ViT-B/32"

B"ResNet50"

C"BERT"

D"GPT-2"

Attempts:

3 left

2fill in blank

medium

Complete the code to tokenize the input text for CLIP.

Computer Vision

text = clip.tokenize([1])

Drag options to blanks, or click blank then click option'

A12345

B"a photo of a cat"

C['a', 'photo', 'of', 'a', 'cat']

DNone

Attempts:

3 left

3fill in blank

hard

Fix the error in the code to move the image tensor to the correct device for CLIP inference.

Computer Vision

device = "cuda" if torch.cuda.is_available() else "cpu"
image_input = preprocess(image).unsqueeze(0).[1](device)

Drag options to blanks, or click blank then click option'

Adevice

Bcuda

Cto

Dcpu

Attempts:

3 left

4fill in blank

hard

Fill both blanks to compute image and text features and normalize them for similarity calculation.

Computer Vision

with torch.no_grad():
    image_features = model.encode_image([1])
    text_features = model.encode_text([2])

image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)

Drag options to blanks, or click blank then click option'

Aimage_input

Btext_input

Cimage

Dtext

Attempts:

3 left

5fill in blank

hard

Fill all three blanks to calculate the similarity scores between image and text features and get the top matching text index.

Computer Vision

similarity = (100.0 * image_features @ [1].T).softmax(dim=-1)
top_prob, top_label = similarity[0].[2](dim=0)
print(f"Top matching text index: [3]")

Drag options to blanks, or click blank then click option'

Atext_features

Btopk

Cargmax

Dtop_label

Attempts:

3 left

Practice

(1/5)

1. What is the main purpose of the CLIP model in computer vision?

easy

A. To connect images and text by learning their relationship

B. To generate images from random noise

C. To classify images into fixed categories without text

D. To detect objects using bounding boxes only

CLIP (vision-language model) in Computer Vision - Interactive Code Practice

Start learning this pattern below

Practice

Solution

Step 1: Understand CLIP's design goal

Step 2: Compare options with CLIP's purpose

Final Answer:

Quick Check:

Solution

Step 1: Recall the transformers library syntax

Step 2: Match options to correct syntax

Final Answer:

Quick Check:

Solution

Step 1: Understand model.get_image_features output

Step 2: Analyze the conversion to numpy array

Final Answer:

Quick Check:

Solution

Step 1: Check how model methods accept inputs

Step 2: Identify the error and fix

Final Answer:

Quick Check:

Solution

Step 1: Understand CLIP feature comparison

Step 2: Evaluate options for matching

Final Answer:

Quick Check: