When we want to see how well a model guesses categories, the confusion matrix helps us understand the details. It shows how many times the model got things right or wrong for each category. This helps us pick the right metric like accuracy, precision, or recall depending on what matters most for our problem.
Confusion matrix visualization in TensorFlow - Model Metrics & Evaluation
Predicted
0 1
Actual 0 | 50 | 10 |
1 | 5 | 35 |
Where:
- 50 = True Negative (TN)
- 10 = False Positive (FP)
- 5 = False Negative (FN)
- 35 = True Positive (TP)
Total samples = 50 + 10 + 5 + 35 = 100This table shows how many times the model guessed each class correctly or incorrectly.
Precision tells us how many of the items the model said were positive actually are positive. For example, in spam detection, high precision means few good emails are wrongly marked as spam.
Recall tells us how many of the actual positive items the model found. For example, in cancer detection, high recall means the model finds most cancer cases, even if it sometimes makes mistakes.
Improving one often lowers the other, so we choose based on what is more important: avoiding false alarms or missing real cases.
For the confusion matrix above:
- Precision = TP / (TP + FP) = 35 / (35 + 10) = 0.78 (78%)
- Recall = TP / (TP + FN) = 35 / (35 + 5) = 0.88 (88%)
- Accuracy = (TP + TN) / Total = (35 + 50) / 100 = 0.85 (85%)
Good: Precision and recall above 80% means the model is reliable in both finding positives and not making many mistakes.
Bad: Precision or recall below 50% means the model often makes wrong guesses or misses many positives.
- Accuracy paradox: High accuracy can be misleading if classes are unbalanced. For example, if 95% of data is negative, a model that always guesses negative has 95% accuracy but is useless.
- Data leakage: When the model accidentally learns from future or test data, metrics look better but the model fails in real use.
- Overfitting: Very high training metrics but poor test metrics mean the model memorizes training data and won't generalize.
Your model has 98% accuracy but only 12% recall on fraud cases. Is it good for production?
Answer: No. The model misses 88% of fraud cases (low recall), which is dangerous. Even with high accuracy, it fails to catch most frauds, so it is not good for production.