NLPml~8 mins

Answer span extraction in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Metrics & Evaluation - Answer span extraction

Which metric matters for Answer span extraction and WHY

Answer span extraction means finding the exact part of text that answers a question. The main metric to check is Exact Match (EM). It tells us how often the model finds the answer exactly right. Another key metric is F1 score, which measures how much the predicted answer overlaps with the true answer. These metrics matter because in real life, getting the exact answer or a very close one is what counts.

Confusion matrix or equivalent visualization

For answer span extraction, we don't use a classic confusion matrix like in classification. Instead, we compare predicted spans to true spans:

True answer span: "the quick brown fox"
Predicted span:  "quick brown"

Overlap tokens: 2
Total tokens in true answer: 4
Total tokens in predicted answer: 2

F1 = 2 * (Precision * Recall) / (Precision + Recall)
Precision = Overlap / Predicted tokens = 2/2 = 1.0
Recall = Overlap / True tokens = 2/4 = 0.5
F1 = 2 * (1.0 * 0.5) / (1.0 + 0.5) = 0.67
Exact Match = 0 (because spans are not exactly the same)

This shows how F1 captures partial correctness, while Exact Match is strict.

Precision vs Recall tradeoff with examples

In answer span extraction, precision means how much of the predicted answer is correct, and recall means how much of the true answer the model found.

High precision, low recall: The model gives short answers that are always correct but miss some parts. For example, predicting "brown fox" when the true answer is "the quick brown fox". This is safe but incomplete.

High recall, low precision: The model gives long answers that include the true answer but also extra words. For example, predicting "the quick brown fox jumps" when the true answer is "quick brown fox". This covers the answer but adds noise.

Good models balance precision and recall to get a high F1 score, meaning answers are mostly correct and mostly complete.

What "good" vs "bad" metric values look like for answer span extraction

Good: Exact Match above 70% and F1 score above 80% usually mean the model finds answers correctly and mostly exactly. This is great for applications like chatbots or search engines.

Bad: Exact Match below 40% and F1 below 50% show the model struggles to find correct answers or only finds partial or wrong spans. This leads to poor user experience.

Common pitfalls in metrics for answer span extraction

Ignoring partial matches: Only using Exact Match misses cases where the answer is mostly right but not exact.
Overfitting: High Exact Match on training data but low on new data means the model memorizes answers instead of understanding.
Data leakage: If test questions appear in training, metrics look better but don't reflect real performance.
Ignoring answer length: Very short or very long predicted spans can skew precision or recall.

Self-check question

Your answer span extraction model has 85% Exact Match but only 60% F1 score. Is it good? Why or why not?

Answer: This means the model often finds exact answers but sometimes misses partial overlaps. It might be too strict or miss some answer parts. Improving recall to raise F1 would help make answers more complete and useful.

Key Result

Exact Match and F1 score are key metrics; Exact Match checks exact answers, F1 balances partial correctness.

Practice

(1/5)

1. What is the main goal of answer span extraction in NLP?

easy

A. To generate new text based on a prompt

B. To find the exact part of text that answers a question

C. To summarize long documents into short sentences

D. To translate text from one language to another

Answer span extraction in NLP - Model Metrics & Evaluation

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of answer span extraction

Step 2: Compare with other NLP tasks

Final Answer:

Quick Check:

Solution

Step 1: Identify typical data types for positions

Step 2: Evaluate options

Final Answer:

Quick Check:

Solution

Step 1: Identify tokens and their indices

Step 2: Extract tokens from start to end index

Final Answer:

Quick Check:

Solution

Step 1: Understand the problem with indices

Step 2: Choose a fix that preserves valid spans

Final Answer:

Quick Check:

Solution

Step 1: Understand logits for start and end tokens

Step 2: Combine logits to find best span

Final Answer:

Quick Check: