Model versioning helps track different versions of a machine learning model over time. The key metrics to watch are those that show how well each version performs, such as accuracy, precision, recall, and F1 score. These metrics tell us if a new model version is better or worse than the old one. Without tracking these, we can't know if changes improved the model or made it worse.
Model versioning in ML Python - Model Metrics & Evaluation
Version 1 Confusion Matrix:
Predicted
P N
Actual P 80 20
N 15 85
Version 2 Confusion Matrix:
Predicted
P N
Actual P 90 10
N 20 80
Total samples = 200 for each version
This matrix helps calculate precision, recall, and accuracy for each version to compare performance.
When comparing model versions, sometimes one version has higher precision but lower recall, and another has the opposite. For example, a spam filter model version might catch more spam (higher recall) but also mark more good emails as spam (lower precision). Deciding which version to keep depends on what matters more: avoiding false alarms or catching all spam.
A good model version shows improved or stable metrics compared to the previous one. For example, if accuracy rises from 85% to 90%, and precision and recall also improve or stay balanced, that is good. A bad version might have lower accuracy or a big drop in recall or precision, meaning it misses more true cases or makes more mistakes.
- Accuracy Paradox: A model might have high accuracy but poor recall on important classes, misleading version comparison.
- Data Leakage: If test data leaks into training, metrics look better but don't reflect real performance.
- Overfitting: A new version might perform well on training data but worse on new data, so metrics can be misleading without proper validation.
- Ignoring Business Context: Metrics alone don't tell the full story; consider what errors cost most in your application.
Your new model version has 98% accuracy but only 12% recall on fraud cases. Is it good for production? Why or why not?
Answer: No, it is not good. Even though accuracy is high, the model misses most fraud cases (low recall). For fraud detection, catching fraud (high recall) is critical, so this version would fail in real use.