0
0
Computer Visionml~8 mins

CLIP (vision-language model) in Computer Vision - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - CLIP (vision-language model)
Which metric matters for CLIP and WHY

CLIP matches images and text. The key metric is zero-shot accuracy. It shows how well CLIP picks the right text label for an image without extra training. This tells us if CLIP understands the connection between pictures and words.

Other useful metrics include Recall@K, which checks if the correct label is in the top K guesses. This matters because CLIP often suggests several possible matches.

Confusion matrix or equivalent visualization

CLIP's output is a similarity score between images and text. We can build a confusion matrix by treating the highest scoring text as the prediction for each image.

      | Predicted: Cat | Predicted: Dog | Predicted: Car |
      |----------------|----------------|----------------|
      | Actual: Cat    |      85        |       10       |       5        |
      | Actual: Dog    |      8         |       90       |       2        |
      | Actual: Car    |      3         |       7        |       90       |
    

This shows how often CLIP correctly matches images to their true labels (diagonal numbers) versus mistakes (off-diagonal).

Precision vs Recall tradeoff with examples

For CLIP, precision means when it says an image matches a text, how often is it right? Recall means how many true matches it finds out of all possible matches.

Example: If CLIP is used to find images of "dogs" in a big gallery, high recall means it finds most dog images. High precision means most images it finds are really dogs.

If you want to avoid showing wrong images (like cats labeled as dogs), prioritize precision. If you want to find all dog images even if some mistakes happen, prioritize recall.

What "good" vs "bad" metric values look like for CLIP

Good: Zero-shot accuracy above 70% on standard datasets means CLIP understands image-text links well. Recall@5 above 90% means the right label is almost always in the top 5 guesses.

Bad: Accuracy below 50% means CLIP struggles to match images and text. Low recall means it misses many correct matches, making it unreliable for search or classification.

Common pitfalls in CLIP metrics
  • Accuracy paradox: High accuracy can happen if many images belong to one class, hiding poor performance on others.
  • Data leakage: Using test images or captions seen during training inflates metrics falsely.
  • Overfitting: Fine-tuning CLIP on small datasets can reduce generalization, hurting zero-shot ability.
  • Ignoring top-K metrics: Only checking top-1 accuracy misses how well CLIP ranks relevant labels.
Self-check question

Your CLIP model has 98% accuracy but only 12% recall on a rare class like "fire trucks." Is it good for production? Why or why not?

Answer: No, it is not good. The high accuracy likely comes from many common classes, but the very low recall on "fire trucks" means CLIP misses most fire truck images. For rare but important classes, recall matters more to avoid missing them.

Key Result
Zero-shot accuracy and Recall@K are key metrics showing how well CLIP matches images to text without extra training.