For a custom question answering (QA) model, the key metrics are Exact Match (EM) and F1 score. Exact Match checks if the model's answer exactly matches the correct answer, which shows how precise the model is. F1 score measures the overlap between the predicted and true answers, balancing precision and recall. These metrics matter because QA answers can be short phrases or sentences, so partial matches are important to capture. High EM means the model is very accurate, and high F1 means it understands the answer well even if wording differs.
Custom QA model fine-tuning in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
QA models don't use a classic confusion matrix like classification. Instead, we compare predicted answers to true answers using token-level overlap.
True Answer: "Paris is the capital of France"
Predicted: "The capital of France is Paris"
Tokens matched: Paris, capital, France
Tokens in true answer: 6
Tokens in predicted answer: 7
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Precision = matched tokens / predicted tokens = 3/7 ≈ 0.429
Recall = matched tokens / true tokens = 3/6 = 0.5
F1 = 2 * 0.429 * 0.5 / (0.429 + 0.5) ≈ 0.462
Exact Match = 0 (answers not exactly the same)
In QA, precision means how much of the predicted answer is correct, and recall means how much of the true answer the model found.
High precision, low recall: The model gives short answers that are always correct but miss some details. For example, answering "Paris" when the full answer is "Paris is the capital of France." This is safe but incomplete.
High recall, low precision: The model gives long answers that include the correct info but also extra wrong words. For example, "Paris is the capital of France and a big city in Europe." This covers the answer but adds noise.
Good QA models balance precision and recall to give answers that are correct and complete.
Good QA model:
- Exact Match (EM) above 70% means the model often gets the answer exactly right.
- F1 score above 80% means the model captures most of the correct answer even if wording differs.
Bad QA model:
- EM below 40% means the model rarely matches answers exactly.
- F1 below 50% means the model misses many important words or adds wrong info.
- Exact Match is too strict: It ignores partially correct answers that are still useful.
- Overfitting: Very high EM and F1 on training data but low on new questions means the model memorized answers, not learned to generalize.
- Data leakage: If test questions appear in training, metrics will be falsely high.
- Ignoring answer variability: Some questions have multiple correct answers; metrics must consider synonyms or paraphrases.
Your custom QA model has 60% Exact Match but 85% F1 score on the test set. Is it good for production? Why or why not?
Answer: This means the model captures most of the answer well even if wording differs (high F1), which is great. But the lower EM shows it doesn't always get the exact answer right. Depending on your use case, this might be acceptable if partial matches are useful. However, if exact wording matters most, you may want to improve the model to raise EM before production.
Practice
Solution
Step 1: Understand fine-tuning goal
Fine-tuning adjusts a model to perform better on a specific task or dataset.Step 2: Relate to QA models
For QA, fine-tuning helps the model answer questions accurately on your own data.Final Answer:
To make the model answer questions better on your specific data -> Option BQuick Check:
Fine-tuning = better task-specific answers [OK]
- Thinking fine-tuning changes model size
- Confusing fine-tuning with faster training
- Assuming it changes the model's language
Solution
Step 1: Identify required data components
QA models need questions, contexts (where answers are found), and answers to learn properly.Step 2: Check options
Only the dataset with questions, contexts, and answers includes all three necessary parts for training.Final Answer:
A dataset with questions, contexts, and answers -> Option AQuick Check:
QA data = questions + contexts + answers [OK]
- Omitting context in the dataset
- Using unlabeled or random text
- Ignoring the answer field
from transformers import Trainer, TrainingArguments training_args = TrainingArguments(output_dir='./results', num_train_epochs=1) trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset) metrics = trainer.train() print(metrics.metrics['eval_accuracy'])
Solution
Step 1: Understand default metrics in Trainer
By default, Trainer does not compute 'eval_accuracy' unless a compute_metrics function is provided.Step 2: Analyze printed output
Since no compute_metrics is defined, 'eval_accuracy' key won't exist, so accessing it causes a KeyError.Final Answer:
A KeyError because eval_accuracy is not computed by default -> Option DQuick Check:
Default Trainer lacks eval_accuracy metric [OK]
- Assuming eval_accuracy is always computed
- Expecting a syntax error instead of missing metric
- Confusing training steps count with accuracy
ValueError: Expected input batch to have 3 elements (input_ids, attention_mask, token_type_ids). What is the most likely cause?Solution
Step 1: Understand the error message
The error says the input batch misses token_type_ids, which are needed for some QA models.Step 2: Check dataset output
If the dataset's __getitem__ method does not return token_type_ids, the model input is incomplete causing this error.Final Answer:
Your dataset does not return token_type_ids in __getitem__ -> Option CQuick Check:
Missing token_type_ids in data causes input error [OK]
- Blaming TrainingArguments settings
- Assuming model architecture is wrong
- Thinking optimizer causes input shape errors
Solution
Step 1: Identify overfitting risk factors
Small datasets can cause models to memorize instead of generalize, leading to overfitting.Step 2: Choose strategies to reduce overfitting
Early stopping stops training when performance stops improving; lower learning rate helps gradual learning.Final Answer:
Use early stopping and lower learning rate -> Option AQuick Check:
Early stopping + low LR reduces overfitting [OK]
- Training too many epochs on small data
- Removing context which is essential
- Increasing batch size without adjusting learning rate
