What if you could never lose hours of training work, even if your computer crashes?
Why Checkpoint with optimizer state in PyTorch? - Purpose & Use Cases
Imagine training a deep learning model for hours on your computer. Suddenly, the power goes out or your program crashes. Without saving your progress, you must start all over from the beginning.
Manually restarting training wastes time and energy. You lose all the learning your model did so far. Also, without saving the optimizer state, your model forgets how it was adjusting weights, making training slower and less stable.
Using checkpoints that save both the model and optimizer states lets you pause and resume training exactly where you left off. This means no lost progress and smoother training continuation.
torch.save(model.state_dict(), 'model.pth')torch.save({'model': model.state_dict(), 'optimizer': optimizer.state_dict()}, 'checkpoint.pth')You can safely stop and restart training anytime without losing progress or optimizer momentum.
A data scientist training a large neural network on a shared server can save checkpoints regularly. If the server restarts or the job is paused, they resume training seamlessly without starting over.
Training can be interrupted without losing progress.
Optimizer state saves learning momentum for better results.
Checkpoints make long training jobs manageable and reliable.