0
0
Prompt Engineering / GenAIml~8 mins

GenAI applications (text, image, code, audio) - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - GenAI applications (text, image, code, audio)
Which metric matters for GenAI applications and WHY

Generative AI (GenAI) creates new content like text, images, code, or audio. To check if it works well, we use different metrics depending on the type of content.

For text, we look at perplexity (how well the model predicts words) and BLEU or ROUGE scores (how close generated text is to human examples).

For images, we use FID (Fréchet Inception Distance) to measure how similar generated images are to real ones.

For code, correctness matters most. We check if generated code runs without errors and passes tests.

For audio, we measure quality with Mean Opinion Score (MOS) or signal similarity metrics.

Overall, the right metric depends on the content type and what matters most: quality, accuracy, or similarity to real data.

Confusion matrix or equivalent visualization

For GenAI, traditional confusion matrices don't apply directly because outputs are creative, not just right or wrong.

Instead, here is an example of a code generation evaluation confusion matrix based on test results:

      | Predicted Correct | Predicted Incorrect |
      |-------------------|---------------------|
      | True Positive (TP) | False Negative (FN)  |
      | False Positive (FP)| True Negative (TN)   |
    

Where:

  • TP: Generated code passes tests and is correct.
  • FP: Code predicted correct but fails tests.
  • FN: Code predicted incorrect but actually correct.
  • TN: Code predicted incorrect and is incorrect.

This helps calculate precision and recall for code generation quality.

Precision vs Recall tradeoff with concrete examples

In GenAI, precision and recall tradeoffs depend on the application:

  • Text generation: High precision means generated text is mostly relevant and correct, but may miss some ideas (lower recall). High recall means covering many ideas but risking errors.
  • Image generation: High precision means generated images look very real, but fewer variations. High recall means many diverse images but some may look fake.
  • Code generation: High precision means most generated code works correctly (few bugs). High recall means generating many possible solutions but some may fail.
  • Audio generation: High precision means clear, natural sound but less variety. High recall means many audio styles but some lower quality.

Choosing precision or recall depends on what matters more: avoiding mistakes (precision) or covering many possibilities (recall).

What "good" vs "bad" metric values look like for GenAI

Text: Good models have low perplexity (better prediction), BLEU/ROUGE scores above 0.5 (closer to human text). Bad models have high perplexity and scores near 0.

Images: Good models have FID scores below 50 (closer to real images). Bad models have FID above 100 (poor quality).

Code: Good models generate code passing 90%+ of tests. Bad models fail most tests or produce syntax errors.

Audio: Good models score MOS above 4 (natural sound). Bad models score below 2 (robotic or noisy).

Common pitfalls in GenAI metrics
  • Overfitting: Model memorizes training data, so metrics look great but new outputs are poor.
  • Data leakage: Test data accidentally included in training, inflating scores.
  • Accuracy paradox: High accuracy but poor quality outputs (e.g., repetitive text).
  • Ignoring diversity: Good metrics but outputs lack variety, making them boring or predictable.
  • Human evaluation needed: Automated metrics can miss creativity or meaning, so human checks are important.
Self-check question

Your GenAI model for code generation has 98% accuracy but only 12% recall on generating correct code snippets. Is it good for production? Why or why not?

Answer: No, it is not good. While 98% accuracy sounds high, the very low recall means the model misses most correct code snippets. It fails to generate many valid solutions, which is critical for usefulness. So, the model is not reliable enough for production.

Key Result
GenAI evaluation metrics vary by content type; precision and recall tradeoffs depend on application needs.