When we make predictions with a model, we want to know how well it works. The main metrics to check are accuracy, precision, recall, and F1 score. These tell us if the model guesses right, if it finds all the important cases, and if it avoids false alarms. We choose the metric based on what matters most for the problem.
Prediction and evaluation in TensorFlow - Model Metrics & Evaluation
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) | False Negative (FN) |
| False Positive (FP) | True Negative (TN) |
Example:
TP = 50, FP = 10, TN = 30, FN = 10
Total samples = 100
Precision means when the model says "yes", how often is it right? High precision means few false alarms.
Recall means how many actual "yes" cases the model finds. High recall means it misses very few real cases.
Example 1: Spam filter - high precision is important so good emails are not marked as spam.
Example 2: Cancer detection - high recall is important so no cancer cases are missed.
Good values: Accuracy > 90%, Precision and Recall both above 85%, F1 score close to 1.
Bad values: Accuracy around 50% (random guessing), Precision or Recall below 50%, F1 score very low.
Note: High accuracy alone can be misleading if classes are imbalanced.
- Accuracy paradox: High accuracy can hide poor performance on rare classes.
- Data leakage: Using future or test data in training inflates metrics falsely.
- Overfitting: Very high training accuracy but low test accuracy means model memorizes data, not learns.
Your model has 98% accuracy but only 12% recall on fraud cases. Is it good for production?
Answer: No. Even though accuracy is high, the model misses 88% of fraud cases (low recall). This is bad because catching fraud is critical. The model needs improvement to find more fraud cases.