Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Question answering in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Question answering
Which metric matters for Question Answering and WHY

For question answering, the main goal is to get the correct answer from the model. We often use Exact Match (EM) and F1 score to check how well the model answers.

Exact Match measures if the model's answer exactly matches the correct answer. It is strict but clear.

F1 score looks at the overlap between the words in the model's answer and the correct answer. It balances precision (how many words in the answer are correct) and recall (how many correct words the model found).

These metrics matter because answers can be short or long, and sometimes the model's answer is close but not exact. F1 helps measure partial correctness.

Confusion Matrix or Equivalent Visualization

In question answering, we don't always use a classic confusion matrix because answers are text, not just yes/no. But we can think like this:

    +----------------+----------------+
    |                | Model Answer   |
    |                | Correct | Wrong|
    +----------------+---------+-------+
    | True Answer    |   TP    |  FN   |
    | Not Relevant   |   FP    |  TN   |
    +----------------+---------+-------+
    

Here, TP means the model gave a correct answer, FN means it missed the correct answer, FP means it gave a wrong answer, and TN means no answer when none was needed.

Precision vs Recall Tradeoff with Examples

Precision means when the model answers, how often is it right?

Recall means how many of all correct answers did the model find?

Example: If a model answers only when very sure, it has high precision but might miss many questions (low recall).

Example: If a model tries to answer every question, it might get more correct (high recall) but also more wrong answers (low precision).

For question answering, a good balance is important. Too many wrong answers confuse users, but missing answers can be frustrating.

What Good vs Bad Metric Values Look Like

Good: Exact Match above 80% and F1 score above 85% means the model answers correctly most of the time and with good detail.

Bad: Exact Match below 50% and F1 below 60% means the model often gives wrong or incomplete answers.

High F1 but low Exact Match means answers are close but not exact, which might be okay depending on use.

Common Pitfalls in Metrics
  • Ignoring partial correctness: Only using Exact Match misses answers that are mostly right.
  • Data leakage: If test questions appear in training, metrics look better but model is not truly learning.
  • Overfitting: Model performs well on training questions but poorly on new ones.
  • Ignoring answer variations: Different but correct answers can lower Exact Match unfairly.
Self Check

Your question answering model has 98% accuracy but only 12% recall on hard questions. Is it good for production?

Answer: No. High accuracy here might mean the model answers only easy questions or guesses often. Low recall means it misses most hard questions, which is bad if those are important.

Key Result
Exact Match and F1 score are key metrics; they measure how exactly and how well the model answers questions.

Practice

(1/5)
1. What is the main purpose of question answering in AI?
easy
A. To find answers from given text or context
B. To generate random text without context
C. To translate languages automatically
D. To create images from descriptions

Solution

  1. Step 1: Understand the goal of question answering

    Question answering systems are designed to find specific answers from a given text or context.
  2. Step 2: Compare options with the goal

    Only To find answers from given text or context describes finding answers from text, which matches the purpose.
  3. Final Answer:

    To find answers from given text or context -> Option A
  4. Quick Check:

    Question answering = find answers [OK]
Hint: Focus on 'answer from text' meaning [OK]
Common Mistakes:
  • Confusing question answering with translation
  • Thinking it generates random text
  • Mixing it with image generation
2. Which input is essential for a question answering model to work?
easy
A. Only a context without a question
B. Only a question without any context
C. A question and a related context or passage
D. Random text unrelated to the question

Solution

  1. Step 1: Identify inputs needed for question answering

    Question answering requires both a question and some context to find the answer.
  2. Step 2: Match options with required inputs

    Only A question and a related context or passage provides both question and related context, which is necessary.
  3. Final Answer:

    A question and a related context or passage -> Option C
  4. Quick Check:

    Question + context = answer [OK]
Hint: Remember: question needs context to answer [OK]
Common Mistakes:
  • Assuming question alone is enough
  • Ignoring the need for context
  • Choosing unrelated text as input
3. Given this Python code using a question answering model:
from transformers import pipeline
qa = pipeline('question-answering')
context = "The Eiffel Tower is in Paris."
question = "Where is the Eiffel Tower located?"
result = qa(question=question, context=context)
print(result['answer'])
What will be printed?
medium
A. Location unknown
B. Eiffel Tower
C. The Eiffel Tower is in Paris
D. Paris

Solution

  1. Step 1: Understand the code's purpose

    The code uses a question answering pipeline to find the answer to the question from the context.
  2. Step 2: Identify the answer in the context

    The question asks for location; the context says "The Eiffel Tower is in Paris." So the answer is "Paris".
  3. Final Answer:

    Paris -> Option D
  4. Quick Check:

    Answer extracted = Paris [OK]
Hint: Look for direct answer in context matching question [OK]
Common Mistakes:
  • Printing the whole context instead of answer
  • Confusing object with location
  • Assuming no answer found
4. This code snippet tries to answer a question but raises an error:
from transformers import pipeline
qa = pipeline('question-answering')
context = "Python is a programming language."
question = "What is Python?"
result = qa(question, context)
print(result['answer'])
What is the error and how to fix it?
medium
A. Error: question is invalid; fix by changing question text
B. Error: missing keyword arguments; fix by using qa(question=question, context=context)
C. Error: context is empty; fix by adding text to context
D. No error; code runs fine

Solution

  1. Step 1: Identify the function call error

    The pipeline expects keyword arguments question= and context=, but code passes positional arguments.
  2. Step 2: Fix the call with correct keywords

    Change to qa(question=question, context=context) to fix the error.
  3. Final Answer:

    Error: missing keyword arguments; fix by using qa(question=question, context=context) -> Option B
  4. Quick Check:

    Use keywords for qa() args [OK]
Hint: Use keyword arguments for question and context [OK]
Common Mistakes:
  • Passing positional args instead of keywords
  • Assuming empty context causes error
  • Changing question text unnecessarily
5. You want to build a question answering system that can handle multiple paragraphs and find the best answer. Which approach is best?
hard
A. Split text into paragraphs, run QA on each, then pick highest confidence answer
B. Combine all paragraphs into one string and run QA once
C. Only use the first paragraph for QA
D. Ignore paragraphs and guess answer randomly

Solution

  1. Step 1: Understand handling multiple paragraphs

    QA models usually work best on smaller text chunks, so splitting helps.
  2. Step 2: Choose method to find best answer

    Running QA on each paragraph separately and selecting the answer with highest confidence ensures accuracy.
  3. Final Answer:

    Split text into paragraphs, run QA on each, then pick highest confidence answer -> Option A
  4. Quick Check:

    Split + score answers = best result [OK]
Hint: Split text, run QA per part, pick best answer [OK]
Common Mistakes:
  • Running QA on all text at once causing confusion
  • Ignoring paragraphs reduces accuracy
  • Guessing answers without context