ML Pythonml~8 mins

Named Entity Recognition basics in ML Python - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Named Entity Recognition basics

Which metric matters for Named Entity Recognition and WHY

Named Entity Recognition (NER) finds names like people, places, or dates in text. We want to know how well the model finds these names correctly.

Precision tells us how many found names are actually correct. High precision means few wrong names.

Recall tells us how many real names the model found out of all names. High recall means few missed names.

F1 score balances precision and recall. It is the best single number to see overall NER quality.

We use these because NER is about finding exact names, so both missing names and wrong names matter.

Confusion matrix for Named Entity Recognition

    |-----------------------------|
    |           | Predicted       |
    | Actual    | Entity | Not Entity |
    |-----------------------------|
    | Entity    |  TP    |    FN     |
    | Not Entity|  FP    |    TN     |
    |-----------------------------|

    TP = Correctly found entities
    FP = Wrongly found entities (false alarms)
    FN = Missed entities
    TN = Correctly ignored non-entities

Example: If model finds 80 correct names (TP), misses 20 names (FN), and wrongly tags 10 words as names (FP), precision = 80/(80+10)=0.89, recall = 80/(80+20)=0.8.

Precision vs Recall tradeoff with examples

Imagine a NER model for medical records:

High precision, low recall: Model finds only very sure names, so few wrong names but misses many real ones. Good if wrong names cause big problems.
High recall, low precision: Model finds almost all names but also many wrong ones. Good if missing a name is worse than extra false names.

Choosing depends on what matters more: avoiding false names or missing real names.

What good vs bad metric values look like for NER

Good: Precision and recall both above 0.85, F1 score above 0.85 means model finds most names correctly and misses few.
Bad: Precision or recall below 0.5 means many wrong names or many missed names, making the model unreliable.
Very high precision but very low recall means model is too strict, missing many names.
Very high recall but very low precision means model is too loose, tagging many wrong names.

Common pitfalls in NER metrics

Accuracy paradox: Most words are not names, so a model that tags nothing can have high accuracy but is useless.
Data leakage: If test data has same sentences as training, metrics look better but model may fail on new text.
Overfitting: Very high training metrics but low test metrics means model memorizes training names, not generalizing well.
Ignoring entity boundaries: Partial matches count as errors, so metrics must consider exact entity spans.

Self-check question

Your NER model has 98% accuracy but only 12% recall on entities. Is it good for production?

Answer: No. The high accuracy is misleading because most words are not entities. The very low recall means the model misses almost all real names, so it is not useful.

Key Result

F1 score balances precision and recall to best measure Named Entity Recognition quality.