Information extraction (IE) aims to find specific pieces of information from text, like names or dates. The key metrics are Precision and Recall. Precision tells us how many extracted items are actually correct. Recall tells us how many of the total correct items we found. We want both high, but sometimes one matters more depending on the task.
Information extraction patterns in NLP - Model Metrics & Evaluation
| Predicted Yes | Predicted No |
|---------------|--------------|
| True Positive | False Negative|
| False Positive| True Negative |
TP = Correctly extracted info
FP = Extracted info that is wrong
FN = Missed info that should be extracted
TN = Correctly ignored non-info
If you want to avoid wrong info in your output, focus on high precision. For example, a legal document extractor must not add false facts.
If you want to find all possible info, even if some are wrong, focus on high recall. For example, a news aggregator wants to catch all names mentioned, even if some are mistakes.
Balancing both is key. The F1 score helps measure this balance.
Good: Precision and Recall above 0.8 means most extracted info is correct and most info is found.
Bad: Precision below 0.5 means many wrong extractions. Recall below 0.5 means many missed extractions.
Example: Precision=0.9, Recall=0.85 is good. Precision=0.4, Recall=0.3 is bad.
- Accuracy paradox: High accuracy can be misleading if most text has no info to extract.
- Data leakage: Testing on data too similar to training inflates metrics.
- Overfitting: Model extracts perfectly on training but fails on new text.
- Ignoring class imbalance: Info to extract is rare, so metrics must consider this.
Your IE model has 98% accuracy but only 12% recall on extracting names. Is it good?
Answer: No. The model misses most names (low recall), so it is not useful despite high accuracy. It finds very few correct names.