For text-to-image models, the key metrics focus on how well the generated image matches the prompt. Common metrics include CLIP score, which measures similarity between the text prompt and the image, and FID (Fréchet Inception Distance), which measures image quality and diversity compared to real images. These metrics matter because they tell us if the prompt leads to images that are both relevant and visually realistic.
Text-to-image prompt crafting in Prompt Engineering / GenAI - Model Metrics & Evaluation
Text-to-image generation does not use a confusion matrix like classification. Instead, we use similarity scores. For example, a CLIP score ranges from 0 to 1, where higher means better match between prompt and image.
Prompt: "A red apple on a wooden table" Generated Image CLIP score: 0.85 (high similarity) Prompt: "A blue car in the forest" Generated Image CLIP score: 0.45 (low similarity)
In text-to-image, think of precision as how accurately the image matches the prompt details, and recall as how well the image covers all aspects of the prompt.
Example:
- High precision, low recall: The image shows a red apple but misses the wooden table.
- Low precision, high recall: The image has a table and something red, but it is not clearly an apple.
Good prompt crafting aims to balance both, so the image is accurate and complete.
Good: CLIP score above 0.75, FID score low (closer to 0), and images clearly show prompt details.
Bad: CLIP score below 0.5, high FID score, images are blurry, unrelated, or miss key prompt elements.
- Overfitting: Model may generate images that look good on training prompts but fail on new prompts.
- Data leakage: If test prompts are too similar to training data, metrics may be misleadingly high.
- Accuracy paradox: High CLIP score does not always mean good image quality; images can match text but be unrealistic.
- Ignoring diversity: Low FID means images are realistic but may lack variety, causing repetitive outputs.
Your text-to-image model has a CLIP score of 0.9 but the images are blurry and lack detail. Is this good? Why or why not?
Answer: Not necessarily good. A high CLIP score means the image matches the prompt text, but blurriness and lack of detail show poor image quality. You need to improve image clarity and realism, not just text-image similarity.