During training with model.fit(), the key metrics are loss and accuracy (for classification). Loss tells us how far off the model's predictions are from the true answers. Accuracy shows how many predictions are correct. We watch these to see if the model is learning and improving over time.
model.fit() training loop in TensorFlow - Model Metrics & Evaluation
Confusion Matrix (example for binary classification):
Predicted
0 1
Actual 0 50 10
1 5 35
Here:
- True Positives (TP) = 35
- True Negatives (TN) = 50
- False Positives (FP) = 10
- False Negatives (FN) = 5
Total samples = 50 + 10 + 5 + 35 = 100
From this, metrics like precision and recall can be calculated to understand model performance.
When training a model, improving one metric can lower another. For example, if the model tries to catch all positive cases (high recall), it might also catch more wrong ones (lower precision). If it tries to be very sure before predicting positive (high precision), it might miss some positives (lower recall). Understanding this helps decide what to focus on depending on the problem.
Example: In training a spam filter with model.fit(), you might want high precision to avoid marking good emails as spam. But in training a cancer detector, high recall is more important to catch all possible cases.
Good training metrics:
- Loss steadily decreases over epochs.
- Accuracy steadily increases and stabilizes at a high value (e.g., above 80% for many tasks).
- Precision and recall improve together or balance well for the problem.
Bad training metrics:
- Loss stays high or fluctuates wildly.
- Accuracy remains low or does not improve.
- Precision or recall is very low, indicating poor prediction quality.
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced (e.g., 95% accuracy by always predicting the majority class).
- Data leakage: If training data leaks into validation, metrics look better but model won't work well on new data.
- Overfitting indicators: Training loss decreases but validation loss increases, showing the model memorizes training data but fails to generalize.
- Ignoring validation metrics: Only looking at training metrics can hide poor real-world performance.
Your model.fit() training shows 98% accuracy but only 12% recall on fraud cases. Is this good for production? Why or why not?
Answer: No, it is not good. Even though accuracy is high, the model misses most fraud cases (low recall). For fraud detection, catching fraud (high recall) is critical. This model would let many frauds go undetected.