0
0
Prompt Engineering / GenAIml~8 mins

Text-to-image prompt crafting in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Text-to-image prompt crafting
Which metric matters for text-to-image prompt crafting and WHY

For text-to-image models, the key metrics focus on how well the generated image matches the prompt. Common metrics include CLIP score, which measures similarity between the text prompt and the image, and FID (Fréchet Inception Distance), which measures image quality and diversity compared to real images. These metrics matter because they tell us if the prompt leads to images that are both relevant and visually realistic.

Confusion matrix or equivalent visualization

Text-to-image generation does not use a confusion matrix like classification. Instead, we use similarity scores. For example, a CLIP score ranges from 0 to 1, where higher means better match between prompt and image.

Prompt: "A red apple on a wooden table"
Generated Image CLIP score: 0.85 (high similarity)

Prompt: "A blue car in the forest"
Generated Image CLIP score: 0.45 (low similarity)
Precision vs Recall tradeoff with concrete examples

In text-to-image, think of precision as how accurately the image matches the prompt details, and recall as how well the image covers all aspects of the prompt.

Example:

  • High precision, low recall: The image shows a red apple but misses the wooden table.
  • Low precision, high recall: The image has a table and something red, but it is not clearly an apple.

Good prompt crafting aims to balance both, so the image is accurate and complete.

What "good" vs "bad" metric values look like for text-to-image prompt crafting

Good: CLIP score above 0.75, FID score low (closer to 0), and images clearly show prompt details.

Bad: CLIP score below 0.5, high FID score, images are blurry, unrelated, or miss key prompt elements.

Metrics pitfalls
  • Overfitting: Model may generate images that look good on training prompts but fail on new prompts.
  • Data leakage: If test prompts are too similar to training data, metrics may be misleadingly high.
  • Accuracy paradox: High CLIP score does not always mean good image quality; images can match text but be unrealistic.
  • Ignoring diversity: Low FID means images are realistic but may lack variety, causing repetitive outputs.
Self-check question

Your text-to-image model has a CLIP score of 0.9 but the images are blurry and lack detail. Is this good? Why or why not?

Answer: Not necessarily good. A high CLIP score means the image matches the prompt text, but blurriness and lack of detail show poor image quality. You need to improve image clarity and realism, not just text-image similarity.

Key Result
CLIP score and FID are key metrics to evaluate how well images match prompts and their quality.