CLIP matches images and text. The key metric is zero-shot accuracy. It shows how well CLIP picks the right text label for an image without extra training. This tells us if CLIP understands the connection between pictures and words.
Other useful metrics include Recall@K, which checks if the correct label is in the top K guesses. This matters because CLIP often suggests several possible matches.