For question answering (QA) systems that extract answers, Exact Match (EM) and F1 score are key metrics. EM checks if the predicted answer exactly matches the true answer, showing how precise the system is. F1 score balances precision and recall by measuring how many words in the predicted answer overlap with the true answer. These metrics matter because QA systems must find the right answer text precisely and completely from a passage.
Why QA systems extract answers in NLP - Why Metrics Matter
Start learning this pattern below
Jump into concepts and practice - no test required
Predicted Answer
+----------------+----------------+
| Exact Match | No Exact Match |
+---+----------------+----------------+
| T | True Positive | False Negative |
| r | (correctly | (missed the |
| u | extracted) | correct answer)|
| e +----------------+----------------+
| A | False Positive | True Negative |
| n | (wrong answer) | (correctly no |
| s | | answer given) |
+---+----------------+----------------+
Note: In QA extraction, True Negative is less common because the task is to find an answer span.
Precision means how many extracted answers are actually correct. High precision means the system rarely gives wrong answers.
Recall means how many correct answers the system finds out of all possible correct answers. High recall means the system rarely misses the right answer.
Example: If a QA system extracts answers only when very sure, it has high precision but might miss some answers (low recall). If it extracts many answers, it finds more correct ones (high recall) but may include wrong ones (low precision).
Balancing precision and recall is important depending on use case. For example, a medical QA system should have high recall to avoid missing critical answers, while a chatbot might prefer high precision to avoid confusing users.
Good QA system: EM and F1 scores above 80% show the system extracts answers accurately and completely.
Bad QA system: EM below 50% and F1 below 60% means many answers are wrong or incomplete, making the system unreliable.
Also, if precision is very high but recall is very low, the system misses many answers. If recall is high but precision is low, many answers are wrong. Both cases reduce usefulness.
- Exact Match too strict: Small differences like punctuation or synonyms can cause EM to be low even if answer is good.
- Ignoring context: Extracted answer might be correct words but wrong meaning if context is missed.
- Data leakage: Training on test questions can inflate metrics falsely.
- Overfitting: High training scores but low test scores mean the model memorizes answers instead of understanding.
- Ignoring partial credit: F1 helps but still may not capture answer usefulness fully.
Your QA model has 85% Exact Match but only 40% recall on answers. Is it good?
Answer: No, because the model finds correct answers precisely when it does, but it misses many answers overall. This low recall means many questions remain unanswered, which can frustrate users. Improving recall while keeping precision high is needed.
Practice
Solution
Step 1: Understand the purpose of QA systems
QA systems are designed to find specific answers from a given text to help users quickly.Step 2: Compare options with QA system goals
Only To provide quick and exact information to users matches the goal of providing quick and exact information, while others describe unrelated tasks.Final Answer:
To provide quick and exact information to users -> Option AQuick Check:
QA systems extract answers = quick, exact info [OK]
- Confusing QA with translation or summarization
- Thinking QA generates random text
- Assuming QA only summarizes documents
Solution
Step 1: Recall how QA systems work
QA systems need both a question and a context (text) to find the correct answer.Step 2: Evaluate each option
Only Provide a question and context text, then call the QA model to extract the answer correctly describes providing question and context to extract an answer; others miss key inputs or are irrelevant.Final Answer:
Provide a question and context text, then call the QA model to extract the answer -> Option BQuick Check:
QA usage = question + context [OK]
- Trying to get answers without context
- Providing unrelated documents without a question
- Using random inputs instead of text
question = "What color is the sky?" context = "The sky is blue during the day and black at night." answer = qa_model(question=question, context=context) print(answer)What is the expected output?
Solution
Step 1: Understand the question and context
The question asks for the sky's color, and the context says "The sky is blue during the day and black at night."Step 2: Identify the correct answer from context
The model should extract "blue" as the color of the sky (the direct answer to the question).Final Answer:
"blue" -> Option CQuick Check:
Sky color = blue [OK]
- Choosing 'black' because it appears in context
- Confusing time of day with color
- Picking unrelated words from context
Solution
Step 1: Analyze why QA systems return empty answers
If the question does not match the context, the system cannot find an answer and returns empty.Step 2: Evaluate options for likely cause
The question is not related to the provided context correctly identifies mismatch as cause; others are incorrect or unrealistic.Final Answer:
The question is not related to the provided context -> Option DQuick Check:
Unrelated question = empty answer [OK]
- Assuming model always fails
- Ignoring question-context relevance
- Thinking empty answer means error
Solution
Step 1: Understand customer needs in support
Customers usually want quick, exact answers to their questions rather than long summaries.Step 2: Compare answer extraction vs summarization
Extracting exact answers targets specific questions, while summaries provide general info, which may be less helpful.Final Answer:
Because customers want quick, precise answers, not long summaries -> Option AQuick Check:
Customer support needs precise answers [OK]
- Thinking summaries are always error-prone
- Assuming summaries can't be automated
- Confusing speed with accuracy
