For question answering (QA) tasks using Hugging Face pipelines, the key metrics are Exact Match (EM) and F1 score. Exact Match measures how often the model's answer exactly matches the correct answer. F1 score measures the overlap between the predicted and true answer words, balancing precision and recall. These metrics matter because QA answers can be short phrases or sentences, so exact matches are strict, while F1 allows partial credit for close answers.
QA with Hugging Face pipeline in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
QA tasks do not use a traditional confusion matrix because answers are text, not classes. Instead, evaluation compares predicted answers to true answers using token-level overlap.
True answer: "Paris"
Predicted answer: "Paris"
Exact Match: 1 (correct)
True answer: "Paris"
Predicted answer: "the city of Paris"
Exact Match: 0 (not exact)
F1 score: calculated from word overlap
In QA, precision means how many words in the predicted answer are correct, and recall means how many words from the true answer were found. For example:
- Predicted: "Paris"
- True: "Paris"
Precision = 1, Recall = 1 (perfect)
- True: "Paris"
- Predicted: "the city"
- True: "Paris"
Precision = 0 (no correct words), Recall = 0 (missed true answer)
- True: "Paris"
- Predicted: "city of Paris"
- True: "Paris"
Precision = 1/3, Recall = 1 (all true words found but extra words included)
- True: "Paris"
Good QA models balance precision and recall to get high F1 scores.
Good QA models have Exact Match scores above 70% and F1 scores above 80% on standard datasets. Bad models have low EM (below 40%) and low F1 (below 50%), meaning answers are often wrong or incomplete.
Example:
- Good: EM = 75%, F1 = 85% (answers mostly correct and complete)
- Bad: EM = 30%, F1 = 45% (answers often wrong or missing key info)
- Exact Match too strict: Small differences like punctuation or articles cause zero score even if answer is close.
- Ignoring partial credit: Relying only on EM misses partial correct answers; F1 helps here.
- Data leakage: Training on test questions inflates scores falsely.
- Overfitting: High training scores but low test scores mean model memorizes answers, not generalizes.
- Ambiguous questions: Multiple correct answers can confuse metric calculations.
Your QA model has 85% Exact Match but only 50% F1 score. Is it good? Why or why not?
Answer: This is unusual because F1 should be equal or higher than EM. A low F1 suggests the model's answers are often exact but very short or missing parts, or there may be an error in calculation. You should check the evaluation method and ensure answers are fully captured. Generally, both EM and F1 should be high for a good QA model.
Practice
Solution
Step 1: Understand the QA pipeline purpose
The QA pipeline is designed to find answers from a given text based on a question.Step 2: Match function to options
Only It finds the answer to the question from the given context. describes finding an answer from the context, which is the pipeline's main job.Final Answer:
It finds the answer to the question from the given context. -> Option CQuick Check:
QA pipeline = find answer from context [OK]
- Confusing QA with translation or summarization
- Thinking it generates new questions
- Assuming it works without context
Solution
Step 1: Recall correct import and pipeline creation
The correct import is from transformers import pipeline, then call pipeline('question-answering').Step 2: Check each option syntax
Only from transformers import pipeline qa = pipeline('question-answering') matches the correct syntax and function call.Final Answer:
from transformers import pipeline qa = pipeline('question-answering') -> Option DQuick Check:
Correct import and pipeline call = from transformers import pipeline qa = pipeline('question-answering') [OK]
- Wrong import statement
- Incorrect pipeline argument
- Using non-existent classes or functions
from transformers import pipeline
qa = pipeline('question-answering')
result = qa(question='Where is the Eiffel Tower?', context='The Eiffel Tower is in Paris.')
print(result['answer'])Solution
Step 1: Understand the question and context
The question asks for the location of the Eiffel Tower, and the context states it is in Paris.Step 2: Predict the pipeline answer output
The pipeline extracts the answer span from the context, which is 'Paris'.Final Answer:
Paris -> Option BQuick Check:
Answer extracted = Paris [OK]
- Choosing the full phrase instead of the exact answer
- Confusing question with context text
- Expecting the pipeline to generate new text
from transformers import pipeline
qa = pipeline('question-answering')
result = qa(question='Who wrote Hamlet?', text='Hamlet was written by Shakespeare.')
print(result['answer'])Solution
Step 1: Check pipeline argument names
The QA pipeline expects 'question' and 'context' as arguments, not 'text'.Step 2: Verify other parts of the code
Pipeline name and import are correct; accessing result['answer'] is valid.Final Answer:
The argument 'text' should be 'context'. -> Option AQuick Check:
Use 'context' argument for QA pipeline [OK]
- Using 'text' instead of 'context'
- Changing pipeline name incorrectly
- Wrong result access syntax
Solution
Step 1: Understand pipeline input limits
QA pipelines work best on one context at a time; long concatenated text may reduce accuracy.Step 2: Evaluate options for multiple documents
Running QA on each document separately and selecting the best answer is effective and practical.Final Answer:
Run the QA pipeline separately on each document and pick the answer with highest score. -> Option AQuick Check:
Separate runs + best score = best multi-doc QA [OK]
- Concatenating all documents causing context overflow
- Ignoring documents except first
- Unnecessarily retraining models
