0
0
NLPml~8 mins

NER with spaCy in NLP - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - NER with spaCy
Which metric matters for NER with spaCy and WHY

In Named Entity Recognition (NER), we want to find and label words like names, places, or dates correctly. The key metrics are Precision, Recall, and F1-score.

Precision tells us how many of the entities the model found are actually correct. This matters because we don't want to label wrong words as entities.

Recall tells us how many of the real entities the model found. This matters because missing important entities means the model is incomplete.

F1-score balances precision and recall. It gives one number to see how well the model does overall.

We use these metrics because NER is about both finding entities and labeling them correctly.

Confusion matrix for NER (simplified)
          | Predicted Entity | Predicted Non-Entity
    ------|------------------|---------------------
    True  |        TP        |          FN         
    Entity| (correct entity) | (missed entity)     
    ------|------------------|---------------------
    True  |        FP        |          TN         
    Non-  | (wrong entity)   | (correct non-entity) 
    Entity|                  |                     
    

TP = True Positives: entities correctly found.
FP = False Positives: wrong words labeled as entities.
FN = False Negatives: real entities missed.
TN = True Negatives: non-entities correctly ignored.

Precision vs Recall tradeoff with examples

If the model has high precision but low recall, it means it labels entities carefully but misses many real ones. For example, a medical NER system that only tags very sure disease names but misses rare diseases.

If the model has high recall but low precision, it finds most entities but also labels many wrong words. For example, a news NER system that tags many words as people but includes many mistakes.

For NER, a good balance (high F1-score) is important because we want to find most entities and be correct.

What good vs bad metric values look like for NER

Good NER model: Precision, Recall, and F1-score all above 85%. This means it finds most entities and labels them correctly.

Bad NER model: Precision or Recall below 50%. This means it either misses many entities or makes many wrong labels.

Example: Precision=90%, Recall=40% means many entities missed (bad recall). Precision=40%, Recall=90% means many wrong labels (bad precision).

Common pitfalls in NER metrics
  • Ignoring entity boundaries: Partial matches are not counted as correct. The model must get the full entity right.
  • Data leakage: Testing on data the model saw during training inflates metrics falsely.
  • Imbalanced entities: Some entity types may be rare, so overall metrics can hide poor performance on rare types.
  • Overfitting: Very high training scores but low test scores mean the model memorizes instead of learning.
Self-check: Your model has 98% accuracy but 12% recall on entities. Is it good?

No, this model is not good for NER. The high accuracy is misleading because most words are not entities, so the model guesses "non-entity" most of the time and is right.

The very low recall (12%) means it misses almost all real entities. This defeats the purpose of NER, which is to find entities.

Better metrics to trust are precision, recall, and F1-score on the entity class, not overall accuracy.

Key Result
For NER with spaCy, F1-score balancing precision and recall best shows model quality.