How to Fix CUDA Out of Memory Error in PyTorch
CUDA out of memory error in PyTorch happens when your GPU runs out of memory during model training or inference. To fix it, reduce your batch size, clear unused variables with torch.cuda.empty_cache(), or move some operations to the CPU.Why This Happens
This error occurs because your GPU does not have enough memory to hold all the data and model parameters during training or inference. Large batch sizes, big models, or memory leaks cause this problem.
import torch def train(): model = torch.nn.Linear(1000, 1000).cuda() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for _ in range(100): inputs = torch.randn(512, 1000).cuda() # Large batch size outputs = model(inputs) loss = outputs.sum() loss.backward() optimizer.step() optimizer.zero_grad() train()
The Fix
Reduce the batch size to use less GPU memory. Also, clear unused GPU memory with torch.cuda.empty_cache() after each iteration. This helps PyTorch reuse memory efficiently.
import torch def train_fixed(): model = torch.nn.Linear(1000, 1000).cuda() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for _ in range(100): inputs = torch.randn(128, 1000).cuda() # Smaller batch size outputs = model(inputs) loss = outputs.sum() loss.backward() optimizer.step() optimizer.zero_grad() torch.cuda.empty_cache() # Clear unused memory train_fixed()
Prevention
To avoid this error in the future, always monitor GPU memory usage during training. Use smaller batch sizes or gradient accumulation if needed. Free unused variables and call optimizer.zero_grad() to clear gradients. Consider using mixed precision training to reduce memory use.
Related Errors
Other common GPU memory errors include RuntimeError: CUDA error: device-side assert triggered caused by invalid tensor operations, and cuDNN error: CUDNN_STATUS_ALLOC_FAILED which also relates to memory allocation issues. Fixes often involve checking tensor shapes and reducing memory load.