CLIP is a model that understands images and text together. How does it learn this connection?
Think about how the model sees both images and text at the same time during training.
CLIP learns by looking at many image-text pairs and training its image and text parts to produce similar outputs for matching pairs. This way, it connects images and their descriptions.
Given a batch of 8 images, each resized to 224x224 pixels with 3 color channels, what is the shape of the output from CLIP's image encoder?
import torch from torchvision.transforms import Compose, Resize, ToTensor, Normalize # Dummy batch of 8 images images = torch.randn(8, 3, 224, 224) # Assume clip_model is loaded and has an image encoder # output = clip_model.encode_image(images) # What is output.shape?
The image encoder outputs a vector embedding per image, not an image tensor.
CLIP's image encoder outputs a fixed-length vector (usually 512 dimensions) for each image in the batch. So for 8 images, the output shape is (8, 512).
CLIP uses a specific type of neural network to process text input. Which one is it?
Think about models good at understanding sequences and context in language.
CLIP's text encoder is a transformer model, which is excellent at handling sequences of words and capturing context.
CLIP can classify images without training on specific classes. Which metric measures how well it does this?
Think about how classification models are usually evaluated.
Top-1 accuracy measures how often the model's top predicted class matches the true class, which is suitable for zero-shot classification evaluation.
CLIP computes similarity scores between image and text vectors. Sometimes these scores are negative. Why?
Recall the range of cosine similarity values between two vectors.
CLIP uses cosine similarity between normalized embeddings, which ranges from -1 (opposite) to 1 (identical). Negative values mean the vectors point in opposite directions.