0
0
Computer Visionml~12 mins

CLIP (vision-language model) in Computer Vision - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - CLIP (vision-language model)

CLIP is a model that learns to connect pictures and words. It understands images by matching them with text descriptions, helping computers see and read together.

Data Flow - 7 Stages
1Input Data
10000 samples (images and texts)Collect pairs of images and their matching text captions10000 image-text pairs
Image: photo of a dog; Text: 'a dog playing in the park'
2Image Preprocessing
10000 images (224x224 pixels, 3 color channels)Resize and normalize images for the vision model10000 images (224x224x3, normalized)
Raw photo resized and pixel values normalized
3Text Preprocessing
10000 text captions (variable length)Tokenize and convert words to numbers for the text model10000 token sequences (max length 77 tokens)
'a dog playing in the park' -> [12, 45, 78, 34, 9]
4Feature Extraction
Images (224x224x3), Text tokens (max 77)Use separate neural networks to get image and text features10000 image features (512 dims), 10000 text features (512 dims)
Image feature vector: [0.12, 0.45, ..., 0.33], Text feature vector: [0.11, 0.47, ..., 0.30]
5Feature Normalization
Image and text features (512 dims each)Normalize features to have length 1 for cosine similarityNormalized image and text features (512 dims each)
Normalized vector length = 1
6Similarity Computation
Normalized image and text featuresCalculate cosine similarity scores between image and text pairsSimilarity scores matrix (10000 x 10000)
Score between image 1 and text 1: 0.85
7Loss Calculation
Similarity scores matrixCompute contrastive loss to bring matching pairs closer and push non-matching pairs apartScalar loss value
Loss = 0.45
Training Trace - Epoch by Epoch

Loss
2.5 |****
2.0 |*** 
1.5 |**  
1.0 |*   
0.5 |    
0.0 +----
     1 5 10 15 20 Epochs
EpochLoss ↓Accuracy ↑Observation
12.30.12High loss and low accuracy as model starts learning
51.10.45Loss decreasing and accuracy improving steadily
100.60.70Model learning meaningful image-text relations
150.40.82Good convergence with high accuracy
200.30.88Loss low and accuracy high, model well trained
Prediction Trace - 5 Layers
Layer 1: Image Preprocessing
Layer 2: Text Preprocessing
Layer 3: Feature Extraction
Layer 4: Feature Normalization
Layer 5: Similarity Computation
Model Quiz - 3 Questions
Test your understanding
What does the similarity score in CLIP represent?
AHow well the image and text match
BThe size of the input image
CThe number of tokens in the text
DThe training loss value
Key Insight
CLIP learns to connect images and text by training two networks together and comparing their outputs. Normalizing features and using contrastive loss helps the model understand which images and texts belong together.