NLPml~8 mins

Why QA systems extract answers in NLP - Why Metrics Matter

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Metrics & Evaluation - Why QA systems extract answers

Which metric matters for this concept and WHY

For question answering (QA) systems that extract answers, Exact Match (EM) and F1 score are key metrics. EM checks if the predicted answer exactly matches the true answer, showing how precise the system is. F1 score balances precision and recall by measuring how many words in the predicted answer overlap with the true answer. These metrics matter because QA systems must find the right answer text precisely and completely from a passage.

Confusion matrix or equivalent visualization (ASCII)

    Predicted Answer
    +----------------+----------------+
    | Exact Match    | No Exact Match |
+---+----------------+----------------+
| T | True Positive  | False Negative |
| r | (correctly     | (missed the    |
| u | extracted)     | correct answer)|
| e +----------------+----------------+
| A | False Positive | True Negative  |
| n | (wrong answer) | (correctly no  |
| s |                | answer given)  |
+---+----------------+----------------+

Note: In QA extraction, True Negative is less common because the task is to find an answer span.

Precision vs Recall tradeoff with concrete examples

Precision means how many extracted answers are actually correct. High precision means the system rarely gives wrong answers.

Recall means how many correct answers the system finds out of all possible correct answers. High recall means the system rarely misses the right answer.

Example: If a QA system extracts answers only when very sure, it has high precision but might miss some answers (low recall). If it extracts many answers, it finds more correct ones (high recall) but may include wrong ones (low precision).

Balancing precision and recall is important depending on use case. For example, a medical QA system should have high recall to avoid missing critical answers, while a chatbot might prefer high precision to avoid confusing users.

What "good" vs "bad" metric values look like for this use case

Good QA system: EM and F1 scores above 80% show the system extracts answers accurately and completely.

Bad QA system: EM below 50% and F1 below 60% means many answers are wrong or incomplete, making the system unreliable.

Also, if precision is very high but recall is very low, the system misses many answers. If recall is high but precision is low, many answers are wrong. Both cases reduce usefulness.

Metrics pitfalls

Exact Match too strict: Small differences like punctuation or synonyms can cause EM to be low even if answer is good.
Ignoring context: Extracted answer might be correct words but wrong meaning if context is missed.
Data leakage: Training on test questions can inflate metrics falsely.
Overfitting: High training scores but low test scores mean the model memorizes answers instead of understanding.
Ignoring partial credit: F1 helps but still may not capture answer usefulness fully.

Self-check

Your QA model has 85% Exact Match but only 40% recall on answers. Is it good?

Answer: No, because the model finds correct answers precisely when it does, but it misses many answers overall. This low recall means many questions remain unanswered, which can frustrate users. Improving recall while keeping precision high is needed.

Key Result

Exact Match and F1 score best measure how well QA systems extract correct and complete answers.

Practice

(1/5)

1. Why do Question Answering (QA) systems extract answers from text?

easy

A. To provide quick and exact information to users

B. To generate random text for entertainment

C. To translate text into another language

D. To summarize long documents without details

Why QA systems extract answers in NLP - Why Metrics Matter

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of QA systems

Step 2: Compare options with QA system goals

Final Answer:

Quick Check:

Solution

Step 1: Recall how QA systems work

Step 2: Evaluate each option

Final Answer:

Quick Check:

Solution

Step 1: Understand the question and context

Step 2: Identify the correct answer from context

Final Answer:

Quick Check:

Solution

Step 1: Analyze why QA systems return empty answers

Step 2: Evaluate options for likely cause

Final Answer:

Quick Check:

Solution

Step 1: Understand customer needs in support

Step 2: Compare answer extraction vs summarization

Final Answer:

Quick Check: