Answer span extraction means finding the exact part of text that answers a question. The main metric to check is Exact Match (EM). It tells us how often the model finds the answer exactly right. Another key metric is F1 score, which measures how much the predicted answer overlaps with the true answer. These metrics matter because in real life, getting the exact answer or a very close one is what counts.
Answer span extraction in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
For answer span extraction, we don't use a classic confusion matrix like in classification. Instead, we compare predicted spans to true spans:
True answer span: "the quick brown fox"
Predicted span: "quick brown"
Overlap tokens: 2
Total tokens in true answer: 4
Total tokens in predicted answer: 2
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Precision = Overlap / Predicted tokens = 2/2 = 1.0
Recall = Overlap / True tokens = 2/4 = 0.5
F1 = 2 * (1.0 * 0.5) / (1.0 + 0.5) = 0.67
Exact Match = 0 (because spans are not exactly the same)
This shows how F1 captures partial correctness, while Exact Match is strict.
In answer span extraction, precision means how much of the predicted answer is correct, and recall means how much of the true answer the model found.
High precision, low recall: The model gives short answers that are always correct but miss some parts. For example, predicting "brown fox" when the true answer is "the quick brown fox". This is safe but incomplete.
High recall, low precision: The model gives long answers that include the true answer but also extra words. For example, predicting "the quick brown fox jumps" when the true answer is "quick brown fox". This covers the answer but adds noise.
Good models balance precision and recall to get a high F1 score, meaning answers are mostly correct and mostly complete.
Good: Exact Match above 70% and F1 score above 80% usually mean the model finds answers correctly and mostly exactly. This is great for applications like chatbots or search engines.
Bad: Exact Match below 40% and F1 below 50% show the model struggles to find correct answers or only finds partial or wrong spans. This leads to poor user experience.
- Ignoring partial matches: Only using Exact Match misses cases where the answer is mostly right but not exact.
- Overfitting: High Exact Match on training data but low on new data means the model memorizes answers instead of understanding.
- Data leakage: If test questions appear in training, metrics look better but don't reflect real performance.
- Ignoring answer length: Very short or very long predicted spans can skew precision or recall.
Your answer span extraction model has 85% Exact Match but only 60% F1 score. Is it good? Why or why not?
Answer: This means the model often finds exact answers but sometimes misses partial overlaps. It might be too strict or miss some answer parts. Improving recall to raise F1 would help make answers more complete and useful.
Practice
answer span extraction in NLP?Solution
Step 1: Understand the purpose of answer span extraction
Answer span extraction focuses on locating the exact segment in a text that directly answers a question.Step 2: Compare with other NLP tasks
Unlike translation, summarization, or text generation, answer span extraction pinpoints a specific text span as the answer.Final Answer:
To find the exact part of text that answers a question -> Option BQuick Check:
Answer span extraction = find exact answer span [OK]
- Confusing answer span extraction with translation
- Thinking it summarizes text instead of extracting spans
- Assuming it generates new text
Solution
Step 1: Identify typical data types for positions
Positions in text are usually represented by integer indices marking start and end locations.Step 2: Evaluate options
Strings or booleans do not represent positions well; floats for time are unrelated to text spans.Final Answer:
start_index and end_index as integers -> Option AQuick Check:
Positions = integer indices [OK]
- Using strings instead of integer indices
- Confusing character positions with time values
- Using booleans for position markers
'The cat sat on the mat.' and predicted start index = 1, end index = 4, what is the extracted answer span?Solution
Step 1: Identify tokens and their indices
Tokenizing the sentence: ['The'(0), 'cat'(1), 'sat'(2), 'on'(3), 'the'(4), 'mat.'(5)]. The indices given (1 to 4) refer to 0-based token positions.Step 2: Extract tokens from start to end index
In standard extraction, take tokens[start:end] (end exclusive): tokens[1:4] = ['cat'(1), 'sat'(2), 'on'(3)] = 'cat sat on'.Final Answer:
'cat sat on' -> Option AQuick Check:
Extract tokens from start to end index = 'cat sat on' [OK]
- Confusing character indices with token indices
- Off-by-one errors in slicing
- Ignoring punctuation in tokens
Solution
Step 1: Understand the problem with indices
End index smaller than start index is invalid because answer spans must go forward in text.Step 2: Choose a fix that preserves valid spans
Swapping start and end indices corrects the order and keeps the predicted span meaningful.Final Answer:
Swap the start and end indices if end < start -> Option CQuick Check:
Fix invalid spans by swapping indices [OK]
- Ignoring invalid spans instead of fixing
- Forcing fixed span length blindly
- Using only one index loses answer context
Solution
Step 1: Understand logits for start and end tokens
Start and end logits represent scores for each token being the start or end of the answer span.Step 2: Combine logits to find best span
We look for the pair (start, end) with the highest combined score, ensuring start ≤ end to form a valid span.Final Answer:
Find the pair of start and end indices with the highest sum of start and end logits where start ≤ end -> Option DQuick Check:
Combine start and end logits to find best span [OK]
- Ignoring end logits and using start only
- Choosing invalid spans where end < start
- Picking random indices without scores
