Prompt Engineering / GenAIml~8 mins

Question answering in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Metrics & Evaluation - Question answering

Which metric matters for Question Answering and WHY

For question answering, the main goal is to get the correct answer from the model. We often use Exact Match (EM) and F1 score to check how well the model answers.

Exact Match measures if the model's answer exactly matches the correct answer. It is strict but clear.

F1 score looks at the overlap between the words in the model's answer and the correct answer. It balances precision (how many words in the answer are correct) and recall (how many correct words the model found).

These metrics matter because answers can be short or long, and sometimes the model's answer is close but not exact. F1 helps measure partial correctness.

Confusion Matrix or Equivalent Visualization

In question answering, we don't always use a classic confusion matrix because answers are text, not just yes/no. But we can think like this:

    +----------------+----------------+
    |                | Model Answer   |
    |                | Correct | Wrong|
    +----------------+---------+-------+
    | True Answer    |   TP    |  FN   |
    | Not Relevant   |   FP    |  TN   |
    +----------------+---------+-------+

Here, TP means the model gave a correct answer, FN means it missed the correct answer, FP means it gave a wrong answer, and TN means no answer when none was needed.

Precision vs Recall Tradeoff with Examples

Precision means when the model answers, how often is it right?

Recall means how many of all correct answers did the model find?

Example: If a model answers only when very sure, it has high precision but might miss many questions (low recall).

Example: If a model tries to answer every question, it might get more correct (high recall) but also more wrong answers (low precision).

For question answering, a good balance is important. Too many wrong answers confuse users, but missing answers can be frustrating.

What Good vs Bad Metric Values Look Like

Good: Exact Match above 80% and F1 score above 85% means the model answers correctly most of the time and with good detail.

Bad: Exact Match below 50% and F1 below 60% means the model often gives wrong or incomplete answers.

High F1 but low Exact Match means answers are close but not exact, which might be okay depending on use.

Common Pitfalls in Metrics

Ignoring partial correctness: Only using Exact Match misses answers that are mostly right.
Data leakage: If test questions appear in training, metrics look better but model is not truly learning.
Overfitting: Model performs well on training questions but poorly on new ones.
Ignoring answer variations: Different but correct answers can lower Exact Match unfairly.

Self Check

Your question answering model has 98% accuracy but only 12% recall on hard questions. Is it good for production?

Answer: No. High accuracy here might mean the model answers only easy questions or guesses often. Low recall means it misses most hard questions, which is bad if those are important.

Key Result

Exact Match and F1 score are key metrics; they measure how exactly and how well the model answers questions.

Practice

(1/5)

1. What is the main purpose of question answering in AI?

easy

A. To find answers from given text or context

B. To generate random text without context

C. To translate languages automatically

D. To create images from descriptions

Question answering in Prompt Engineering / GenAI - Model Metrics & Evaluation

Start learning this pattern below

Practice

Solution

Step 1: Understand the goal of question answering

Step 2: Compare options with the goal

Final Answer:

Quick Check:

Solution

Step 1: Identify inputs needed for question answering

Step 2: Match options with required inputs

Final Answer:

Quick Check:

Solution

Step 1: Understand the code's purpose

Step 2: Identify the answer in the context

Final Answer:

Quick Check:

Solution

Step 1: Identify the function call error

Step 2: Fix the call with correct keywords

Final Answer:

Quick Check:

Solution

Step 1: Understand handling multiple paragraphs

Step 2: Choose method to find best answer

Final Answer:

Quick Check: