For spaCy as a production-grade NLP tool, the key metrics are speed, accuracy of language tasks (like named entity recognition), and robustness. These metrics matter because in real-world apps, models must be fast to handle many requests, accurate to understand text correctly, and robust to work well on varied inputs.
Why spaCy is production-grade NLP - Why Metrics Matter
Named Entity Recognition Example Confusion Matrix:
Predicted
PER LOC ORG O
True PER 85 5 3 7
LOC 4 90 2 4
ORG 6 3 88 3
O 5 4 2 89
TP = correctly identified entities (diagonal)
FP = wrong predicted entities (off-diagonal in predicted column)
FN = missed entities (off-diagonal in true row)
In spaCy's NLP tasks, precision means how many predicted entities are correct, recall means how many true entities were found.
For example, in a chatbot, high precision avoids wrong answers (don't say "New York" is a person if it is a location). High recall ensures the bot catches all important info.
Sometimes, improving recall lowers precision and vice versa. spaCy balances this well for production use.
Good: Precision and recall above 85% for key NLP tasks, processing speed of hundreds of texts per second, and stable results on new data.
Bad: Precision or recall below 60%, slow processing causing delays, or frequent crashes/errors on real inputs.
- Accuracy paradox: High accuracy can be misleading if data is imbalanced (e.g., many non-entities).
- Data leakage: Testing on data seen during training inflates metrics falsely.
- Overfitting indicators: Very high training accuracy but low test accuracy means poor generalization.
Your spaCy model has 98% accuracy but 12% recall on detecting medical terms. Is it good for production? Why not?
Answer: No, because low recall means it misses most medical terms, which is critical in healthcare. High accuracy alone is misleading if most words are non-medical.