NLPml~8 mins

Extractive QA concept in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Metrics & Evaluation - Extractive QA concept

Which metric matters for Extractive QA and WHY

In Extractive Question Answering (QA), the goal is to find the exact part of a text that answers a question. The key metrics are Exact Match (EM) and F1 score.

Exact Match (EM) checks if the predicted answer exactly matches the true answer. It is strict and shows how often the model gets the answer perfectly right.

F1 score measures overlap between the predicted and true answer words. It balances precision (how many predicted words are correct) and recall (how many true answer words were found). This helps when answers are partially correct.

These metrics matter because Extractive QA needs precise text spans. EM shows perfect hits, while F1 shows partial correctness, giving a fuller picture.

Confusion matrix or equivalent visualization

Extractive QA does not use a classic confusion matrix like classification. Instead, we compare predicted answer spans to true answer spans.

True Answer: "the Eiffel Tower"
Predicted Answer: "Eiffel Tower"

- Exact Match: No (because of missing "the")
- Precision: 2/2 = 1.0 (all predicted words are correct)
- Recall: 2/3 ≈ 0.67 (missed "the")
- F1: 2 * (1.0 * 0.67) / (1.0 + 0.67) ≈ 0.8

This shows how partial matches are scored.

Precision vs Recall tradeoff with examples

In Extractive QA:

High Precision: The model's answers are mostly correct words but might miss some parts. Good when you want very accurate answers.
High Recall: The model finds most of the true answer words but may include extra words. Good when you want to catch all relevant info, even if some noise is included.

Example: For a question about "Where is the Eiffel Tower?" (true answer: "in Paris")

High precision, low recall: "Paris" (correct words, but missing "in")
High recall, low precision: "in Paris, France" (includes extra words)

Balancing precision and recall with F1 helps find the best middle ground.

What "good" vs "bad" metric values look like for Extractive QA

Good values:

Exact Match (EM) above 70% means the model often finds the exact answer.
F1 score above 80% means the model finds most answer words correctly, even if not exact.

Bad values:

EM below 40% means the model rarely gets exact answers.
F1 below 50% means the model often misses many answer words or adds wrong words.

Good scores mean users get reliable answers. Bad scores mean answers are often wrong or incomplete.

Common pitfalls in Extractive QA metrics

Ignoring partial credit: Only using Exact Match misses partially correct answers, which F1 captures.
Answer variations: Different but correct answer forms (e.g., "U.S." vs "United States") can lower EM unfairly.
Data leakage: If test questions appear in training, metrics look better but model won't generalize.
Overfitting: Very high EM on training but low on test means model memorizes rather than understands.

Self-check question

Your Extractive QA model has 50% Exact Match but 85% F1 score. Is it good?

Answer: Yes, this is typical because F1 is usually equal or higher than EM. A high F1 with low EM suggests the model gets many partial matches right but perfect exact answers less often. Overall quality is good. You should investigate answer variations to improve EM.

Key Result

Exact Match shows perfect answer hits; F1 balances partial correctness, both key to judge Extractive QA quality.

Practice

(1/5)

1. What is the main goal of extractive question answering (QA)?

easy

A. To translate the question into another language

B. To generate a new answer not present in the text

C. To summarize the entire text into a short paragraph

D. To find the exact answer span within a given text

Extractive QA concept in NLP - Model Metrics & Evaluation

Start learning this pattern below

Practice

Solution

Step 1: Understand extractive QA purpose

Step 2: Compare options with definition

Final Answer:

Quick Check:

Solution

Step 1: Recall extractive QA output format

Step 2: Match options to output format

Final Answer:

Quick Check:

Solution

Step 1: Understand question and context

Step 2: Identify exact answer span

Final Answer:

Quick Check:

Solution

Step 1: Analyze index values

Step 2: Understand slicing behavior

Final Answer:

Quick Check:

Solution

Step 1: Understand the problem of missing answers

Step 2: Evaluate solution options

Final Answer:

Quick Check: