When comparing models, we want to know which one works best for our task. In computer vision, common metrics include accuracy, precision, recall, and F1 score. These metrics tell us how well the model predicts images correctly, especially when classes are unbalanced. For example, if we want to detect rare objects, recall is important to catch as many as possible. If we want to avoid false alarms, precision matters. Using these metrics helps us pick the model that fits our real-life needs.
Model comparison in Computer Vision - Model Metrics & Evaluation
Predicted
| Cat | Dog |
---+-----+-----+
Cat| 50 | 10 |
Dog| 5 | 35 |
True labels: 50 cats correctly predicted (TP for cat), 10 cats missed (FN for cat), 5 dogs wrongly predicted as cats (FP for cat), 35 dogs correctly predicted (TN for cat).
This matrix helps calculate precision, recall, and accuracy for each class.
Imagine a model that detects defective products in a factory.
- High Precision: Few good products are marked defective. This means less waste but might miss some defects.
- High Recall: Most defective products are caught, but some good ones might be wrongly rejected.
Choosing between precision and recall depends on what is more costly: missing defects or wasting good products.
Good: High accuracy (e.g., 95%+), balanced precision and recall (both above 90%), and high F1 score show the model predicts well and is reliable.
Bad: High accuracy but low recall (e.g., 98% accuracy but 20% recall) means the model misses many important cases. Or very low precision means many false alarms.
- Accuracy Paradox: High accuracy can be misleading if classes are unbalanced.
- Data Leakage: Using test data in training inflates metrics falsely.
- Overfitting: Model performs well on training but poorly on new data.
- Ignoring Context: Choosing metrics without considering what matters in the real task.
No, this model is not good for tasks where finding positive cases is important. The 12% recall means it misses 88% of the positive cases, which can be very harmful, for example in disease detection or fraud. High accuracy here is misleading because most data might be negative cases.