Checkpointing itself is about saving the model's state during training. The key metric to watch is training loss or validation loss over time. This shows if the model is improving. Checkpointing preserves progress by saving these states, so if training stops, you can restart without losing improvements.
0
0
Why checkpointing preserves progress in PyTorch - Why Metrics Matter
Metrics & Evaluation - Why checkpointing preserves progress
Which metric matters for this concept and WHY
Confusion matrix or equivalent visualization (ASCII)
Checkpointing does not directly involve confusion matrices.
Instead, think of it as saving snapshots of training:
Training steps: 1 2 3 4 5 6 7 8 9 10
Loss: 0.9 0.8 0.7 0.6 0.5 0.4 0.35 0.3 0.28 0.25
Checkpoint saved at step 5 (loss 0.5)
If training stops at step 7, you can reload checkpoint from step 5
and continue training from there, not from step 1.
Precision vs Recall (or equivalent tradeoff) with concrete examples
Checkpointing trades off time saved vs storage used. Saving checkpoints often means more storage but less lost work if interrupted. Saving less often saves space but risks losing more progress.
Example: If you save checkpoints every 10 minutes, you lose at most 10 minutes of work on failure. If you save every hour, you risk losing up to an hour of training.
What "good" vs "bad" metric values look like for this use case
Good checkpointing means:
- Checkpoints saved frequently enough to avoid losing much progress.
- Checkpoints correctly restore model and optimizer states.
- Training loss continues to decrease after resuming from checkpoint.
Bad checkpointing means:
- Checkpoints saved too rarely, causing large loss of training time on failure.
- Checkpoints missing optimizer state, causing training to restart badly.
- Loss jumps or training stalls after resuming, indicating corrupted or incomplete checkpoint.
Metrics pitfalls (accuracy paradox, data leakage, overfitting indicators)
Common pitfalls with checkpointing include:
- Not saving optimizer state: causes learning rate and momentum to reset, hurting training.
- Overwriting checkpoints without backups: losing all progress if checkpoint is corrupted.
- Confusing checkpoint saving with model evaluation metrics: checkpointing only saves state, it does not improve metrics by itself.
- Not verifying checkpoint integrity before resuming: can cause silent errors.
Your model has 98% accuracy but 12% recall on fraud. Is it good?
No, it is not good for fraud detection. The low recall (12%) means the model misses most fraud cases, which is dangerous. Checkpointing helps preserve training progress but does not fix poor model performance. You need to improve the model or data, not just rely on checkpointing.
Key Result
Checkpointing preserves training progress by saving model and optimizer states, allowing training to resume without losing improvements.