How to fix nan loss pytorch

PytorchDebug / FixBeginner · 4 min read

How to Fix NaN Loss in PyTorch: Causes and Solutions

NaN loss in PyTorch usually happens because of unstable training like exploding gradients or invalid operations such as division by zero. To fix it, check your data and model for invalid values, use gradient clipping, and ensure your loss function inputs are valid with torch.isnan() checks.

🔍

Why This Happens

NaN loss occurs when the model's calculations produce undefined or infinite values. This can happen if your input data contains NaNs or Infs, if your model outputs extreme values causing overflow, or if your loss function receives invalid inputs like zero division or log of zero.

For example, using a log function on zero or negative numbers can cause NaN loss.

python

import torch
import torch.nn as nn

# Example causing NaN loss due to log(0)
inputs = torch.tensor([[0.0, 0.0], [0.0, 0.0]], requires_grad=True)
targets = torch.tensor([1, 0])

loss_fn = nn.NLLLoss()

# Log of zero causes -inf, leading to NaN loss
log_probs = torch.log(inputs)  # log(0) = -inf
loss = loss_fn(log_probs, targets)
loss.backward()

Output

RuntimeWarning: divide by zero encountered in log RuntimeError: Function 'NllLossBackward' returned nan values in its 0th output.

🔧

The Fix

Fix NaN loss by ensuring inputs to functions like torch.log are positive and non-zero. Add a small value (epsilon) to inputs before log to avoid log(0). Also, check your data for NaNs/Infs and clip gradients to prevent exploding values.

python

import torch
import torch.nn as nn

# Fix by adding epsilon to inputs before log
inputs = torch.tensor([[1e-8, 1e-8], [1e-8, 1e-8]], requires_grad=True)
targets = torch.tensor([1, 0])

loss_fn = nn.NLLLoss()

log_probs = torch.log(inputs)  # inputs already have epsilon, no need to add again
loss = loss_fn(log_probs, targets)
loss.backward()

print(f"Loss value: {loss.item():.6f}")

Output

Loss value: 18.420681

🛡️

Prevention

To avoid NaN loss in the future, always preprocess your data to remove or replace NaNs and Infs. Use torch.isnan() and torch.isinf() to check tensors. Apply gradient clipping with torch.nn.utils.clip_grad_norm_ to keep gradients stable. Also, monitor your loss values during training and stop if NaNs appear.

⚠️

Related Errors

Other common errors related to NaN loss include:

Inf gradients: Caused by very large values; fix with gradient clipping.
Division by zero: Happens in custom loss functions; add small epsilon to denominators.
Invalid label indices: Using wrong target labels in classification losses causes runtime errors.

✅

Key Takeaways

NaN loss usually comes from invalid inputs or unstable training like exploding gradients.

Add small epsilon values before log or division operations to avoid undefined math.

Check your data for NaNs and Infs before training using torch.isnan() and torch.isinf().

Use gradient clipping to keep training stable and prevent exploding gradients.

Monitor loss values during training and stop early if NaNs appear.