When using one-vs-rest (OvR) or one-vs-one (OvO) strategies for multi-class classification, metrics like accuracy, precision, recall, and F1-score matter. This is because these strategies break a multi-class problem into multiple binary problems. We need to measure how well each binary classifier performs and then combine results. Macro-averaged precision, recall, and F1-score help us understand performance equally across all classes, especially if classes are imbalanced.
One-vs-rest and one-vs-one strategies in ML Python - Model Metrics & Evaluation
For OvR, each class has its own binary confusion matrix. For example, with 3 classes (A, B, C), the OvR confusion matrix for class A looks like:
Predicted A | Not A
-----------------------
A | TP | FN
Not A | FP | TN
For OvO, each pair of classes has a binary confusion matrix. For classes A and B:
Predicted A | Predicted B
---------------------------
A | TP | FN
B | FP | TN
All these binary results combine to decide the final multi-class prediction.
In OvR or OvO, each binary classifier faces a tradeoff between precision and recall:
- Precision: How many predicted positives are actually correct? Important if false alarms are costly.
- Recall: How many actual positives are found? Important if missing a class is costly.
Example: For a disease detection with multiple diseases (classes), using OvR:
- If you want to avoid wrongly labeling healthy people as sick (false positives), focus on high precision.
- If you want to catch all sick people (true positives), focus on high recall.
Choosing OvR or OvO affects how these tradeoffs appear because OvO compares pairs directly, often improving precision but increasing complexity.
Good metrics for OvR/OvO multi-class classification:
- Accuracy: High (close to 1.0) means most samples are correctly classified.
- Macro F1-score: High (above 0.8) means balanced performance across all classes.
- Precision and Recall: Both should be reasonably high (above 0.7) for each class to avoid bias.
Bad metrics:
- High accuracy but low recall on some classes means the model misses many samples of those classes.
- High precision but low recall means the model is too strict and misses positives.
- Very low F1-score (below 0.5) indicates poor balance and unreliable classification.
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced. For example, if one class dominates, predicting it always yields high accuracy but poor real performance.
- Data leakage: If test data leaks into training, metrics look unrealistically good.
- Overfitting: Very high training metrics but low test metrics show the model memorizes training data but fails to generalize.
- Ignoring class imbalance: Not using macro-averaged metrics can hide poor performance on minority classes.
No, this model is not good for fraud detection. Although 98% accuracy sounds high, the recall of 12% means it only finds 12% of actual fraud cases. This is bad because missing fraud is costly. The model likely predicts most samples as non-fraud, inflating accuracy but failing its main goal. Improving recall is critical here.