Lemmatization is about finding the base form of words. The key metric here is Accuracy, which measures how many words are correctly converted to their base forms out of all words processed. This matters because correct base forms help many language tasks like search, translation, and understanding.
Lemmatization in spaCy in NLP - Model Metrics & Evaluation
For lemmatization, a confusion matrix can show how many words were correctly lemmatized (True Positives) versus incorrectly lemmatized (False Positives and False Negatives). For example:
| Predicted Correct | Predicted Incorrect
------|-------------------|-------------------
Actual Correct | TP=85 | FN=15
Actual Incorrect | FP=10 | TN=90
Here, TP means words correctly lemmatized, FP means words wrongly lemmatized as correct, FN means words missed, and TN means words correctly identified as not needing change.
Precision tells us how many of the words we labeled as correct base forms really are correct. Recall tells us how many of the actual base forms we found.
For example, if we want to avoid wrong base forms (high precision), we might miss some correct ones (lower recall). If we want to find all base forms (high recall), we might include some wrong ones (lower precision).
In lemmatization, usually high precision is preferred to avoid confusing the meaning, but recall should not be too low to keep usefulness.
Good: Accuracy above 90%, Precision and Recall balanced above 85%. This means most words are correctly lemmatized and few mistakes happen.
Bad: Accuracy below 70%, Precision or Recall very low (below 50%). This means many words are wrongly lemmatized or many base forms are missed, hurting downstream tasks.
- Ignoring context: Some words need sentence context to lemmatize correctly. Metrics may look good on simple words but fail on complex sentences.
- Data leakage: Testing on words seen during training inflates accuracy.
- Overfitting: Model memorizes common words but fails on new words, causing poor real-world performance.
- Accuracy paradox: High accuracy can happen if many words don't need lemmatization, hiding poor performance on actual changes.
Your lemmatization model has 98% accuracy but only 12% recall on rare verb forms. Is it good for production? Why or why not?
Answer: No, it is not good. The high accuracy likely comes from many words that don't change, but the very low recall on rare verbs means the model misses most of these important cases. This hurts tasks relying on correct base forms of verbs.