When saving only the weights of a model, the key metric to watch is the model's performance metrics before and after loading the weights. This ensures the saved weights correctly capture the learned knowledge. Common metrics include loss and accuracy on validation data. If these metrics stay consistent, the weights were saved and loaded properly.
Saving weights only in TensorFlow - Model Metrics & Evaluation
For classification models, a confusion matrix helps check if the loaded weights produce the same predictions as before saving. For example, a binary classifier confusion matrix:
Predicted
| 1 | 0 |
---+-----+-----+
1 | TP | FN |
0 | FP | TN |
After loading weights, the counts of TP, FP, TN, FN should be similar to before saving.
Saving weights only does not directly affect precision or recall, but if weights are corrupted or mismatched, model predictions can degrade, causing precision and recall to drop. For example:
- If weights are saved and loaded correctly, precision and recall remain stable.
- If weights are partially saved or loaded incorrectly, recall might drop, missing positive cases.
- Or precision might drop, causing more false alarms.
Thus, verifying metrics after loading weights is crucial.
Good: After loading weights, validation loss and accuracy remain close to values before saving. Confusion matrix counts are stable. Precision and recall do not drop significantly.
Bad: After loading weights, validation loss increases sharply, accuracy drops, or confusion matrix shows many more errors. This means weights were not saved or loaded properly.
- Accuracy paradox: High accuracy after loading weights might be misleading if the dataset is imbalanced.
- Data leakage: If validation data leaks into training, metrics before saving weights may be unrealistically high.
- Overfitting: If weights are saved from an overfitted model, metrics may look good on training but poor on new data.
- Mismatch in model architecture: Loading weights into a different model structure causes errors or poor metrics.
No, this is not good for fraud detection. The model misses 88% of fraud cases (low recall), which is dangerous. Saving and loading weights correctly is important, but also ensure the model is trained to detect fraud well. High accuracy alone can be misleading if the data is imbalanced.