For Generative AI, quality of output is key. Metrics like Perplexity measure how well the model predicts text, showing if it understands language patterns. BLEU or ROUGE scores compare generated text to human examples, checking if output is meaningful and relevant. For images, FID score measures how close generated images are to real ones. These metrics matter because they tell us if the AI creates believable and useful content.
0
0
What Generative AI actually is in Prompt Engineering / GenAI - Model Metrics & Evaluation
Metrics & Evaluation - What Generative AI actually is
Which metric matters for this concept and WHY
Confusion matrix or equivalent visualization (ASCII)
Generative AI does not use a confusion matrix like classifiers. Instead, we look at example outputs and scores:
Example: Text generation quality scores
---------------------------------------
Model output: "The cat sat on the mat."
Reference: "The cat is sitting on the mat."
BLEU score: 0.75 (higher is better)
Perplexity: 12.3 (lower is better)
For image generation, FID score example:
FID score: 25.4 (lower means generated images look more like real ones)
Precision vs Recall (or equivalent tradeoff) with concrete examples
Generative AI tradeoffs are different from classification. Here, we balance creativity and accuracy.
- High creativity, low accuracy: AI makes new ideas but may produce nonsense or errors.
- High accuracy, low creativity: AI repeats known patterns but output is safe and reliable.
Example: A story generator that invents new plots (creative) vs one that copies training stories exactly (accurate but boring).
What "good" vs "bad" metric values look like for this use case
Good generative AI metrics mean:
- Low perplexity: Model predicts text well, so output is fluent.
- High BLEU/ROUGE: Output matches human examples closely.
- Low FID: Generated images look realistic.
Bad metrics mean:
- High perplexity: Output is confusing or unnatural.
- Low BLEU/ROUGE: Output is irrelevant or off-topic.
- High FID: Images look fake or distorted.
Metrics pitfalls (accuracy paradox, data leakage, overfitting indicators)
- Overfitting: Model memorizes training data, producing perfect but copied outputs, not creative ones.
- Data leakage: If test data is in training, metrics look better but model is not truly generative.
- Metric mismatch: BLEU or ROUGE may not capture creativity or meaning well.
- Perplexity limits: Low perplexity doesn't guarantee interesting or useful output.
Self-check: Your model has low perplexity but low BLEU score. Is it good?
No, this means the model predicts text well (low perplexity) but its output does not match human examples closely (low BLEU). It might produce fluent but irrelevant or generic text. So, it is not good for tasks needing meaningful or accurate content.
Key Result
Generative AI quality is best judged by metrics like perplexity, BLEU, and FID that measure fluency, relevance, and realism.