0
0
Prompt Engineering / GenAIml~8 mins

Key models overview (GPT, DALL-E, Stable Diffusion) in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Key models overview (GPT, DALL-E, Stable Diffusion)
Which metric matters for this concept and WHY

For models like GPT (text generation), DALL-E and Stable Diffusion (image generation), the key metrics differ because their tasks differ.

GPT: We look at perplexity to see how well the model predicts text. Lower perplexity means better predictions.

DALL-E and Stable Diffusion: We use FID (Fréchet Inception Distance) and IS (Inception Score) to measure image quality and diversity. Lower FID and higher IS mean better images.

These metrics help us know if the model creates realistic and useful outputs.

Confusion matrix or equivalent visualization (ASCII)

Since these are generative models, confusion matrices don't apply directly. Instead, we use example outputs and metric scores.

    GPT Perplexity Example:
    ----------------------
    Model predicts next word probabilities.
    Perplexity = 10 means on average the model is as uncertain as choosing among 10 words.

    DALL-E / Stable Diffusion FID Example:
    ------------------------------------
    Real images vs generated images feature comparison.
    Lower FID (e.g., 10) means generated images are close to real ones.
    Higher FID (e.g., 100) means poor quality.
    
Precision vs Recall (or equivalent tradeoff) with concrete examples

For generative models, the tradeoff is often between quality and diversity.

Quality: How realistic and sharp the output is.

Diversity: How varied and creative the outputs are.

Example: A model that always generates the same perfect image has high quality but low diversity.

A model that generates many different images but some look blurry has high diversity but lower quality.

Good models balance both well.

What "good" vs "bad" metric values look like for this use case

GPT:

  • Good perplexity: low (e.g., 10 or less on simple tasks)
  • Bad perplexity: high (e.g., 100 or more means poor text prediction)

DALL-E / Stable Diffusion:

  • Good FID: low (e.g., below 30 means realistic images)
  • Bad FID: high (e.g., above 100 means images look fake or blurry)
  • Good IS: high (e.g., above 8 means diverse and clear images)
  • Bad IS: low (e.g., below 3 means poor image quality or low variety)
Metrics pitfalls (accuracy paradox, data leakage, overfitting indicators)
  • Overfitting: Model memorizes training data, so metrics look great on training but poor on new inputs.
  • Data leakage: If test data leaks into training, metrics falsely improve.
  • Metric mismatch: Using accuracy or classification metrics on generative models is wrong.
  • Ignoring diversity: Only focusing on quality can lead to repetitive outputs.
  • Human evaluation needed: Metrics don't capture creativity or usefulness fully.
Self-check: Your model has 98% accuracy but 12% recall on fraud. Is it good?

This question is about fraud detection, not generative models, but it shows why metrics matter.

98% accuracy sounds good, but 12% recall means the model misses 88% of fraud cases.

This is bad because catching fraud is critical. So, despite high accuracy, the model is not good for production.

Lesson: Always check the right metrics for your task.

Key Result
Generative models require task-specific metrics like perplexity for GPT and FID/IS for image models to evaluate quality and diversity effectively.