Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

GenAI applications (text, image, code, audio) - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - GenAI applications (text, image, code, audio)
Which metric matters for GenAI applications and WHY

Generative AI (GenAI) creates new content like text, images, code, or audio. To check if it works well, we use different metrics depending on the type of content.

For text, we look at perplexity (how well the model predicts words) and BLEU or ROUGE scores (how close generated text is to human examples).

For images, we use FID (Fréchet Inception Distance) to measure how similar generated images are to real ones.

For code, correctness matters most. We check if generated code runs without errors and passes tests.

For audio, we measure quality with Mean Opinion Score (MOS) or signal similarity metrics.

Overall, the right metric depends on the content type and what matters most: quality, accuracy, or similarity to real data.

Confusion matrix or equivalent visualization

For GenAI, traditional confusion matrices don't apply directly because outputs are creative, not just right or wrong.

Instead, here is an example of a code generation evaluation confusion matrix based on test results:

      | Predicted Correct | Predicted Incorrect |
      |-------------------|---------------------|
      | True Positive (TP) | False Negative (FN)  |
      | False Positive (FP)| True Negative (TN)   |
    

Where:

  • TP: Generated code passes tests and is correct.
  • FP: Code predicted correct but fails tests.
  • FN: Code predicted incorrect but actually correct.
  • TN: Code predicted incorrect and is incorrect.

This helps calculate precision and recall for code generation quality.

Precision vs Recall tradeoff with concrete examples

In GenAI, precision and recall tradeoffs depend on the application:

  • Text generation: High precision means generated text is mostly relevant and correct, but may miss some ideas (lower recall). High recall means covering many ideas but risking errors.
  • Image generation: High precision means generated images look very real, but fewer variations. High recall means many diverse images but some may look fake.
  • Code generation: High precision means most generated code works correctly (few bugs). High recall means generating many possible solutions but some may fail.
  • Audio generation: High precision means clear, natural sound but less variety. High recall means many audio styles but some lower quality.

Choosing precision or recall depends on what matters more: avoiding mistakes (precision) or covering many possibilities (recall).

What "good" vs "bad" metric values look like for GenAI

Text: Good models have low perplexity (better prediction), BLEU/ROUGE scores above 0.5 (closer to human text). Bad models have high perplexity and scores near 0.

Images: Good models have FID scores below 50 (closer to real images). Bad models have FID above 100 (poor quality).

Code: Good models generate code passing 90%+ of tests. Bad models fail most tests or produce syntax errors.

Audio: Good models score MOS above 4 (natural sound). Bad models score below 2 (robotic or noisy).

Common pitfalls in GenAI metrics
  • Overfitting: Model memorizes training data, so metrics look great but new outputs are poor.
  • Data leakage: Test data accidentally included in training, inflating scores.
  • Accuracy paradox: High accuracy but poor quality outputs (e.g., repetitive text).
  • Ignoring diversity: Good metrics but outputs lack variety, making them boring or predictable.
  • Human evaluation needed: Automated metrics can miss creativity or meaning, so human checks are important.
Self-check question

Your GenAI model for code generation has 98% accuracy but only 12% recall on generating correct code snippets. Is it good for production? Why or why not?

Answer: No, it is not good. While 98% accuracy sounds high, the very low recall means the model misses most correct code snippets. It fails to generate many valid solutions, which is critical for usefulness. So, the model is not reliable enough for production.

Key Result
GenAI evaluation metrics vary by content type; precision and recall tradeoffs depend on application needs.

Practice

(1/5)
1. Which of the following is NOT a common application of GenAI?
easy
A. Manually coding software without AI help
B. Creating images from simple descriptions
C. Automatically generating text like stories or emails
D. Producing audio like music or speech

Solution

  1. Step 1: Understand GenAI applications

    GenAI is used to create text, images, code, and audio automatically from prompts.
  2. Step 2: Identify the option that does not involve AI

    Manual coding without AI help is not an application of GenAI.
  3. Final Answer:

    Manually coding software without AI help -> Option A
  4. Quick Check:

    GenAI applications exclude manual tasks = A [OK]
Hint: Look for the option that does not involve AI generation [OK]
Common Mistakes:
  • Confusing manual tasks as AI applications
  • Thinking all coding is GenAI
  • Ignoring audio as a GenAI output
2. Which of these is the correct way to prompt a GenAI model to generate an image?
easy
A. Write code to manually draw the image pixel by pixel
B. Upload a photo and ask the model to delete it
C. Type 'Generate a photo of a sunset over mountains' as input
D. Ask the model to write a poem about sunsets

Solution

  1. Step 1: Understand how to prompt GenAI for images

    You give a text description like 'Generate a photo of a sunset over mountains' to get an image.
  2. Step 2: Identify the correct prompt among options

    Type 'Generate a photo of a sunset over mountains' as input is a clear text prompt for image generation; others are unrelated or incorrect.
  3. Final Answer:

    Type 'Generate a photo of a sunset over mountains' as input -> Option C
  4. Quick Check:

    Text prompt for image generation = B [OK]
Hint: Choose the option with a clear text description for image generation [OK]
Common Mistakes:
  • Confusing manual drawing with AI generation
  • Uploading photos is not prompting generation
  • Mixing text generation with image generation
3. Given this Python code using a GenAI text model:
prompt = "Write a short poem about spring"
response = genai_model.generate(prompt)
print(response)
What is the most likely output?
medium
A. SyntaxError: invalid syntax
B. "Spring blooms bright, with colors anew, Nature wakes up, fresh morning dew."
C. A blank line with no output
D. An image file of flowers

Solution

  1. Step 1: Understand the code's purpose

    The code sends a prompt to a GenAI text model to generate a poem about spring.
  2. Step 2: Predict the output type

    The model returns a text poem, so the printed output is a short poem about spring.
  3. Final Answer:

    "Spring blooms bright, with colors anew, Nature wakes up, fresh morning dew." -> Option B
  4. Quick Check:

    GenAI text generation outputs text poem = A [OK]
Hint: GenAI text prompts return text, not errors or images [OK]
Common Mistakes:
  • Expecting code errors from correct syntax
  • Confusing text output with image output
  • Assuming no output from model call
4. You try to generate audio with this code snippet:
audio = genai_model.generate_audio(prompt="Play a relaxing tune")
print(audio)
But you get an error: AttributeError: 'GenAIModel' object has no attribute 'generate_audio'. What is the likely fix?
medium
A. Use the correct method name, like generate(), for audio generation
B. Change the prompt to text instead of audio
C. Restart the computer to fix the error
D. Remove the print statement

Solution

  1. Step 1: Analyze the error message

    The error says the model object has no method named 'generate_audio'.
  2. Step 2: Correct the method call

    Use the existing method like 'generate()' that supports audio generation via prompt.
  3. Final Answer:

    Use the correct method name, like generate(), for audio generation -> Option A
  4. Quick Check:

    Fix method name to existing one = C [OK]
Hint: Check method names carefully in error messages [OK]
Common Mistakes:
  • Ignoring error details
  • Changing prompt instead of method
  • Restarting without debugging code
5. You want to build a GenAI app that takes a user's text prompt and returns both an image and a short audio description. Which approach best combines these tasks?
hard
A. Use one GenAI model that supports multi-modal outputs for text, image, and audio
B. Ask users to upload images and audio instead of generating them
C. Generate only text and convert it manually to image and audio later
D. Use separate GenAI models: one for text-to-image, another for text-to-audio, then combine results

Solution

  1. Step 1: Understand multi-modal generation needs

    Generating both image and audio from text usually requires specialized models for each type.
  2. Step 2: Choose best practical approach

    Using separate models for text-to-image and text-to-audio then combining outputs is common and effective.
  3. Final Answer:

    Use separate GenAI models: one for text-to-image, another for text-to-audio, then combine results -> Option D
  4. Quick Check:

    Separate models for different media = D [OK]
Hint: Combine specialized models for different media types [OK]
Common Mistakes:
  • Assuming one model handles all media perfectly
  • Ignoring need to combine outputs
  • Asking users to upload instead of generating