0
0
Prompt Engineering / GenAIml~8 mins

Question answering in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Question answering
Which metric matters for Question Answering and WHY

For question answering, the main goal is to get the correct answer from the model. We often use Exact Match (EM) and F1 score to check how well the model answers.

Exact Match measures if the model's answer exactly matches the correct answer. It is strict but clear.

F1 score looks at the overlap between the words in the model's answer and the correct answer. It balances precision (how many words in the answer are correct) and recall (how many correct words the model found).

These metrics matter because answers can be short or long, and sometimes the model's answer is close but not exact. F1 helps measure partial correctness.

Confusion Matrix or Equivalent Visualization

In question answering, we don't always use a classic confusion matrix because answers are text, not just yes/no. But we can think like this:

    +----------------+----------------+
    |                | Model Answer   |
    |                | Correct | Wrong|
    +----------------+---------+-------+
    | True Answer    |   TP    |  FN   |
    | Not Relevant   |   FP    |  TN   |
    +----------------+---------+-------+
    

Here, TP means the model gave a correct answer, FN means it missed the correct answer, FP means it gave a wrong answer, and TN means no answer when none was needed.

Precision vs Recall Tradeoff with Examples

Precision means when the model answers, how often is it right?

Recall means how many of all correct answers did the model find?

Example: If a model answers only when very sure, it has high precision but might miss many questions (low recall).

Example: If a model tries to answer every question, it might get more correct (high recall) but also more wrong answers (low precision).

For question answering, a good balance is important. Too many wrong answers confuse users, but missing answers can be frustrating.

What Good vs Bad Metric Values Look Like

Good: Exact Match above 80% and F1 score above 85% means the model answers correctly most of the time and with good detail.

Bad: Exact Match below 50% and F1 below 60% means the model often gives wrong or incomplete answers.

High F1 but low Exact Match means answers are close but not exact, which might be okay depending on use.

Common Pitfalls in Metrics
  • Ignoring partial correctness: Only using Exact Match misses answers that are mostly right.
  • Data leakage: If test questions appear in training, metrics look better but model is not truly learning.
  • Overfitting: Model performs well on training questions but poorly on new ones.
  • Ignoring answer variations: Different but correct answers can lower Exact Match unfairly.
Self Check

Your question answering model has 98% accuracy but only 12% recall on hard questions. Is it good for production?

Answer: No. High accuracy here might mean the model answers only easy questions or guesses often. Low recall means it misses most hard questions, which is bad if those are important.

Key Result
Exact Match and F1 score are key metrics; they measure how exactly and how well the model answers questions.