Overview - Gradient access (.grad)

What is it?

Gradient access (.grad) in PyTorch lets you see the gradients of tensors after a backward pass. Gradients are numbers that tell us how much a change in one value affects the final output. They are essential for training machine learning models by adjusting parameters to reduce errors. The .grad attribute stores these gradients for tensors that require them.

Why it matters

Without access to gradients, models cannot learn from data because they wouldn't know how to improve. Gradients guide the model to make better predictions by showing the direction and size of changes needed. This makes .grad crucial for training neural networks and other machine learning models effectively.

Where it fits

Before learning about .grad, you should understand tensors and automatic differentiation in PyTorch. After mastering .grad, you can explore optimization algorithms that use gradients to update model parameters, like SGD or Adam.

Mental Model

Core Idea

The .grad attribute holds the calculated gradients that show how each tensor affects the final result, enabling learning through adjustment.

Think of it like...

Imagine climbing a hill blindfolded. The gradient is like feeling the slope under your feet, telling you which way to step to go downhill and reach the lowest point.

Tensor (value) ──> Compute graph ──> Backward pass ──> Gradient stored in .grad

┌───────────┐       ┌───────────────┐       ┌───────────────┐
│ Tensor x  │──────▶│ Computation   │──────▶│ Backpropagation│
│ (value)   │       │ graph         │       │ calculates    │
│ .grad     │◀─────│               │◀──────│ gradients     │
└───────────┘       └───────────────┘       └───────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding tensors and requires_grad

Concept: Tensors are the basic data structure in PyTorch, and requires_grad tells PyTorch to track operations for gradient calculation.

In PyTorch, a tensor is like a container for numbers. When you create a tensor with requires_grad=True, PyTorch remembers all operations on it to compute gradients later. For example: import torch x = torch.tensor([2.0, 3.0], requires_grad=True) This means x will have gradients after backward is called.

Result

The tensor x is ready to track operations and will store gradients in x.grad after backward.

Knowing requires_grad is the switch that turns on gradient tracking is key to using .grad effectively.

2

FoundationPerforming backward to compute gradients

3

IntermediateAccessing and interpreting .grad values

4

IntermediateClearing gradients to avoid accumulation

5

IntermediateHandling non-scalar outputs in backward

6

AdvancedAccessing gradients of intermediate tensors

7

ExpertSurprising behavior with in-place operations and .grad

Under the Hood

PyTorch builds a computation graph dynamically as operations happen on tensors with requires_grad=True. When backward() is called on a scalar output, PyTorch traverses this graph in reverse, applying the chain rule to compute gradients for each tensor. These gradients are stored in the .grad attribute of leaf tensors. Intermediate tensors do not keep gradients unless retain_grad() is called. The graph tracks operations and dependencies to efficiently compute gradients.

Why designed this way?

Dynamic computation graphs allow flexibility to build different models on the fly, unlike static graphs. Storing gradients in .grad keeps them accessible for optimization steps. The design balances ease of use, performance, and debugging. Alternatives like static graphs were less flexible and harder to debug, so PyTorch chose dynamic graphs with explicit gradient storage.

┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│ Input tensor  │─────▶│ Computation   │─────▶│ Output scalar │
│ (requires_grad)│      │ graph built   │      │ backward()    │
└───────────────┘      └───────────────┘      └───────────────┘
         │                                         │
         │                                         ▼
         │                               ┌───────────────────┐
         │                               │ Backpropagation   │
         │                               │ computes gradients│
         │                               └───────────────────┘
         │                                         │
         ▼                                         ▼
┌───────────────┐                         ┌───────────────┐
│ .grad stores  │◀────────────────────────│ Leaf tensors  │
│ gradients     │                         │ gradients     │
└───────────────┘                         └───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does .grad hold gradients immediately after tensor creation? Commit yes or no.

Common Belief:Many think .grad is populated as soon as a tensor is created with requires_grad=True.

Tap to reveal reality

Quick: Do gradients in .grad reset automatically after each backward call? Commit yes or no.

Common Belief:Some believe gradients reset automatically after each backward call.

Tap to reveal reality

Quick: Can you call backward() directly on a tensor with multiple elements? Commit yes or no.

Common Belief:Many think backward() works on any tensor regardless of shape without extra arguments.

Tap to reveal reality

Quick: Do intermediate tensors keep gradients by default? Commit yes or no.

Common Belief:Some assume all tensors with requires_grad=True keep gradients automatically.

Tap to reveal reality

Expert Zone

1

Gradients accumulate in .grad, so forgetting to zero them can silently corrupt training results.

2

Intermediate tensors do not keep gradients unless retain_grad() is called, which is essential for debugging complex models.

3

In-place operations on tensors with requires_grad=True can break the computation graph, causing subtle bugs or runtime errors.

When NOT to use

Do not rely on .grad for tensors that do not require gradients or for non-leaf tensors without retain_grad(). For models requiring higher-order gradients, use torch.autograd.grad instead. For static graph frameworks, PyTorch's dynamic .grad approach is not applicable.

Production Patterns

In production training loops, .grad is accessed after backward() to update model parameters with optimizers. Gradients are zeroed each iteration to prevent accumulation. retain_grad() is used selectively for debugging. Avoid in-place ops on parameters to ensure stable gradient computation.

Connections

Chain rule in calculus

Gradient computation in .grad uses the chain rule to propagate derivatives backward through operations.

Understanding the chain rule clarifies why gradients flow backward and how complex functions are differentiated.

Optimization algorithms

Gradients stored in .grad are inputs to optimization algorithms like SGD or Adam that update model parameters.

Knowing how .grad connects to optimizers helps grasp the full training cycle from gradient calculation to parameter update.

Electric circuit analysis

Backpropagation and gradient flow resemble current flow in circuits where changes propagate through connected components.

This analogy helps appreciate how local changes affect the whole system, similar to gradients affecting model outputs.

Common Pitfalls

#1Expecting .grad to have values before backward() is called.

Wrong approach:import torch x = torch.tensor([1.0, 2.0], requires_grad=True) print(x.grad) # expecting gradients here

Correct approach:import torch x = torch.tensor([1.0, 2.0], requires_grad=True) y = x.sum() y.backward() print(x.grad) # gradients available after backward

Root cause:Misunderstanding that gradients are computed only after backward() runs.

#2Not clearing gradients before multiple backward passes, causing accumulation.

Wrong approach:optimizer.zero_grad() # missing loss.backward() optimizer.step() loss.backward() # gradients accumulate incorrectly

Correct approach:optimizer.zero_grad() loss.backward() optimizer.step() optimizer.zero_grad() # clear before next backward

Root cause:Not knowing that .grad accumulates gradients by default.

#3Calling backward() on non-scalar tensor without gradient argument.

Wrong approach:x = torch.tensor([1.0, 2.0], requires_grad=True) y = x * 2 y.backward() # error: grad_output required

Correct approach:x = torch.tensor([1.0, 2.0], requires_grad=True) y = x * 2 y.backward(torch.tensor([1.0, 1.0])) # provide grad_output

Root cause:Not providing gradient argument for non-scalar outputs.

Key Takeaways

The .grad attribute stores gradients only after backward() is called on a scalar output involving the tensor.

Gradients accumulate in .grad by default and must be manually cleared to avoid mixing updates.

Backward requires a scalar output or a matching gradient argument for non-scalar tensors to compute gradients.

Only leaf tensors keep gradients automatically; intermediate tensors need retain_grad() to store gradients.

In-place operations on tensors with requires_grad=True can break gradient computation and should be avoided.