NLPml~8 mins

Open-domain QA basics in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Metrics & Evaluation - Open-domain QA basics

Which metric matters for Open-domain QA and WHY

In open-domain question answering (QA), the main goal is to find the correct answer from a large collection of documents. The key metrics are Exact Match (EM) and F1 score. Exact Match checks if the predicted answer exactly matches the true answer. F1 score measures how much the predicted answer overlaps with the true answer in terms of words. These metrics matter because answers can be short phrases or sentences, so partial correctness is important to capture.

Confusion matrix or equivalent visualization

Open-domain QA does not use a classic confusion matrix like classification. Instead, we evaluate predictions per question:

Question 1: Predicted answer = "Paris"; True answer = "Paris" → Exact Match = 1, F1 = 1.0
Question 2: Predicted answer = "Paris, the capital of France"; True answer = "Paris" → Exact Match = 0, F1 > 0 (partial overlap)
Question 3: Predicted answer = "London"; True answer = "Paris" → Exact Match = 0, F1 = 0

We then average these scores over all questions to get overall performance.

Precision vs Recall tradeoff with concrete examples

In open-domain QA, precision and recall relate to how much of the true answer is captured and how much extra irrelevant text is included.

High precision, low recall: The model gives a very short answer that is correct but misses some details. For example, answer "Paris" when the full answer is "Paris, the capital of France". This is precise but incomplete.
High recall, low precision: The model gives a long answer that includes the correct answer but also extra unrelated words. For example, "Paris is the capital of France and a beautiful city". This covers the true answer but adds noise.

Balancing precision and recall is important to give answers that are both correct and concise.

What "good" vs "bad" metric values look like for Open-domain QA

Good: Exact Match above 70% and F1 score above 80% usually means the model finds correct answers most of the time and captures most answer words.
Bad: Exact Match below 40% and F1 below 50% means the model often misses the correct answer or gives very incomplete answers.

Common pitfalls in Open-domain QA metrics

Ignoring partial matches: Using only Exact Match can be too strict and miss partially correct answers.
Data leakage: If the model sees test answers during training, metrics will be unrealistically high.
Overfitting: Very high training scores but low test scores mean the model memorizes answers instead of understanding.
Ambiguous answers: Some questions have multiple correct answers, so metrics must allow for synonyms or paraphrases.

Self-check question

Your open-domain QA model has 60% Exact Match but 85% F1 score. Is it good? Why or why not?

Answer: This means the model gets the exact answer right less often, but captures most of the answer content via high F1. It is good overall but could improve exact matching and phrasing for better precision balance.

Key Result

Exact Match and F1 score are key metrics; Exact Match measures full correctness, F1 captures partial correctness in open-domain QA.

Practice

(1/5)

1. What is the main goal of open-domain question answering (QA)?

easy

A. To summarize a single document

B. To translate text from one language to another

C. To find answers to any question from a large collection of texts

D. To generate new text based on a prompt

Open-domain QA basics in NLP - Model Metrics & Evaluation

Start learning this pattern below

Practice

Solution

Step 1: Understand the definition of open-domain QA

Step 2: Compare options with this definition

Final Answer:

Quick Check:

Solution

Step 1: Recall the typical open-domain QA pipeline

Step 2: Match options to this pipeline

Final Answer:

Quick Check:

Solution

Step 1: Understand the QA pipeline usage

Step 2: Identify the answer span in the context

Final Answer:

Quick Check:

Solution

Step 1: Analyze the error KeyError: 'answer'

Step 2: Check pipeline initialization

Final Answer:

Quick Check:

Solution

Step 1: Identify the problem cause

Step 2: Choose the best fix

Final Answer:

Quick Check: