For text-to-image models, the key metrics focus on how well the generated image matches the prompt. Common metrics include CLIP score, which measures similarity between the text prompt and the image, and FID (Fréchet Inception Distance), which measures image quality and diversity compared to real images. These metrics matter because they tell us if the prompt leads to images that are both relevant and visually realistic.
Text-to-image prompt crafting in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Text-to-image generation does not use a confusion matrix like classification. Instead, we use similarity scores. For example, a CLIP score ranges from 0 to 1, where higher means better match between prompt and image.
Prompt: "A red apple on a wooden table" Generated Image CLIP score: 0.85 (high similarity) Prompt: "A blue car in the forest" Generated Image CLIP score: 0.45 (low similarity)
In text-to-image, think of precision as how accurately the image matches the prompt details, and recall as how well the image covers all aspects of the prompt.
Example:
- High precision, low recall: The image shows a red apple but misses the wooden table.
- Low precision, high recall: The image has a table and something red, but it is not clearly an apple.
Good prompt crafting aims to balance both, so the image is accurate and complete.
Good: CLIP score above 0.75, FID score low (closer to 0), and images clearly show prompt details.
Bad: CLIP score below 0.5, high FID score, images are blurry, unrelated, or miss key prompt elements.
- Overfitting: Model may generate images that look good on training prompts but fail on new prompts.
- Data leakage: If test prompts are too similar to training data, metrics may be misleadingly high.
- Accuracy paradox: High CLIP score does not always mean good image quality; images can match text but be unrealistic.
- Ignoring diversity: Low FID means images are realistic but may lack variety, causing repetitive outputs.
Your text-to-image model has a CLIP score of 0.9 but the images are blurry and lack detail. Is this good? Why or why not?
Answer: Not necessarily good. A high CLIP score means the image matches the prompt text, but blurriness and lack of detail show poor image quality. You need to improve image clarity and realism, not just text-image similarity.
Practice
Solution
Step 1: Understand the role of a prompt
A prompt is a description that tells the AI what image to make.Step 2: Identify the correct purpose
Only To describe what image you want the AI to create matches this role by describing the desired image.Final Answer:
To describe what image you want the AI to create -> Option AQuick Check:
Prompt = Image description [OK]
- Confusing prompt with coding instructions
- Thinking prompt edits images directly
- Assuming prompt sets colors manually
Solution
Step 1: Identify prompt format
Prompts are plain text descriptions, not code or HTML.Step 2: Match the correct option
"A sunny beach with palm trees and clear blue water" is a clear text description suitable as a prompt.Final Answer:
"A sunny beach with palm trees and clear blue water" -> Option BQuick Check:
Prompt = Plain text description [OK]
- Using code or HTML instead of text
- Confusing prompts with programming functions
- Trying to query images with SQL as prompt
"A red apple on a wooden table, photorealistic style", what kind of image will the AI most likely generate?Solution
Step 1: Analyze prompt details
The prompt says "photorealistic style" and describes a red apple on a wooden table.Step 2: Match prompt to image type
The AI will generate a detailed, realistic photo-like image matching the description.Final Answer:
A detailed, realistic photo of a red apple on wood -> Option CQuick Check:
Photorealistic prompt = Realistic image [OK]
- Ignoring style words and expecting cartoons
- Confusing text prompts with text images
- Assuming blurry or sketch style without prompt
"A futuristic cityscape at night, neon lights, cyberpunk style" but the AI generated a daytime image without neon colors. What is the likely problem?Solution
Step 1: Check prompt clarity
The prompt mentions 'night' and 'neon lights' but may not emphasize them enough for the AI.Step 2: Improve prompt specificity
Adding stronger emphasis or repeating keywords helps AI focus on night and neon colors.Final Answer:
The prompt should specify 'night' and 'neon' more clearly -> Option DQuick Check:
Clear, strong keywords = better AI focus [OK]
- Assuming AI always understands subtle style hints
- Not emphasizing important details enough
- Blaming AI model instead of prompt clarity
Solution
Step 1: Match subject and style
"A cat astronaut on Mars, watercolor painting, soft colors, detailed background" includes the cat astronaut, Mars setting, and watercolor style as requested.Step 2: Check other options
Options B, C, and D miss key elements like the cat, Mars, or watercolor style.Final Answer:
"A cat astronaut on Mars, watercolor painting, soft colors, detailed background" -> Option AQuick Check:
Complete, clear prompt = best image [OK]
- Leaving out main subject or style
- Mixing up animals or settings
- Using vague or unrelated descriptions
