0
0
NLPml~8 mins

Named entity recognition in NLP - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Named entity recognition
Which metric matters for Named Entity Recognition and WHY

Named Entity Recognition (NER) finds names like people, places, or dates in text. We want to know how well the model finds these names and how correct those found names are. So, Precision tells us how many found names are actually correct. Recall tells us how many real names the model found out of all names in the text. F1 score balances both precision and recall to give a single score. These metrics matter because NER needs to find as many correct names as possible without too many mistakes.

Confusion Matrix for Named Entity Recognition

For NER, we look at each entity found as either correct or wrong. Here is a simple confusion matrix example for one entity type (e.g., Person):

      | Predicted Entity | Predicted Not Entity |
      |------------------|---------------------|
      | True Positive (TP) = 80  | False Negative (FN) = 20 |
      | False Positive (FP) = 10 | True Negative (TN) = 890 |
    

Total samples = TP + FP + TN + FN = 80 + 10 + 890 + 20 = 1000 tokens.

From this, Precision = 80 / (80 + 10) = 0.89, Recall = 80 / (80 + 20) = 0.80, F1 = 2 * (0.89 * 0.80) / (0.89 + 0.80) ≈ 0.84.

Precision vs Recall Tradeoff with Examples

In NER, if you want to avoid missing any important names (high recall), you might accept some wrong names (lower precision). For example, in medical records, missing a disease name is bad, so recall is key.

On the other hand, if you want to avoid wrong names (high precision), you might miss some names (lower recall). For example, in legal documents, wrongly tagging a name can cause confusion, so precision is important.

Balancing precision and recall with the F1 score helps find a good middle ground.

What Good vs Bad Metric Values Look Like for NER

Good: Precision and recall both above 0.85 means the model finds most names correctly and misses few. F1 score above 0.85 shows balanced performance.

Bad: Precision below 0.5 means many wrong names are found. Recall below 0.5 means many real names are missed. F1 below 0.5 means poor overall performance.

Example: Precision=0.9, Recall=0.9, F1=0.9 is good. Precision=0.3, Recall=0.7, F1=0.42 is bad.

Common Metrics Pitfalls in NER
  • Accuracy paradox: Most tokens are not entities, so accuracy can be high even if the model never finds entities.
  • Data leakage: If test data is too similar to training data, metrics look better but model may fail on new text.
  • Overfitting: Very high training metrics but low test metrics means the model memorizes training names but cannot generalize.
  • Ignoring entity boundaries: Partial matches count as wrong, so exact match metrics are stricter but more meaningful.
Self Check: Your model has 98% accuracy but 12% recall on entities. Is it good?

No, this is not good for NER. The high accuracy is misleading because most text is not entities. The very low recall (12%) means the model misses almost all real names. It finds very few entities, so it is not useful for finding names.

Key Result
For Named Entity Recognition, balanced precision and recall (measured by F1 score) best show model quality because they capture correct and missed entity detections.