0
0
Computer Visionml~5 mins

CLIP (vision-language model) in Computer Vision

Choose your learning style9 modes available
Introduction

CLIP helps computers understand pictures and words together. It learns to match images with their descriptions so it can find or describe images using language.

You want to find images by typing a description instead of keywords.
You want to label pictures automatically without training on specific categories.
You want to build apps that understand both pictures and text together.
You want to search for images that match a sentence or phrase.
You want to create captions or summaries for images.
Syntax
Computer Vision
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

image = Image.open('path_to_image.jpg')
texts = ['a photo of a cat', 'a photo of a dog']

inputs = processor(text=texts, images=image, return_tensors='pt', padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

Use the CLIPProcessor to prepare both images and text for the model.

The model outputs similarity scores between images and text to find matches.

Examples
This example compares one image to two text descriptions to see which fits better.
Computer Vision
texts = ['a red apple', 'a green apple']
inputs = processor(text=texts, images=image, return_tensors='pt', padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
This example checks how well one image matches one text description.
Computer Vision
image = Image.open('dog.jpg')
text = ['a photo of a dog']
inputs = processor(text=text, images=image, return_tensors='pt')
outputs = model(**inputs)
score = outputs.logits_per_image.item()
Sample Model

This program creates a simple red square image and compares it to two text descriptions using CLIP. It prints how likely the image matches each description.

Computer Vision
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

# Load model and processor
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

# Load an example image
image = Image.new('RGB', (224, 224), color='red')  # simple red square image

# Define text descriptions
texts = ['a red square', 'a blue circle']

# Prepare inputs
inputs = processor(text=texts, images=image, return_tensors='pt', padding=True)

# Get model outputs
outputs = model(**inputs)

# Calculate probabilities
probs = outputs.logits_per_image.softmax(dim=1)

# Print probabilities for each text
for text, prob in zip(texts, probs[0]):
    print(f"Probability that image matches '{text}': {prob.item():.4f}")
OutputSuccess
Important Notes

CLIP works well without needing to train on your own data.

It can compare any image with any text, making it very flexible.

Make sure images are in RGB format and sized properly (usually 224x224 pixels).

Summary

CLIP connects images and text by learning their relationship.

You can use it to find or describe images using natural language.

It is easy to use with pre-trained models and processors.