Bird
Raised Fist0
PyTorchml~20 mins

Gradient clipping in PyTorch - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Gradient Clipping Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
Why use gradient clipping in training?

Imagine you are training a neural network and notice the training loss suddenly spikes or the model weights become very large. Why would applying gradient clipping help in this situation?

AIt prevents gradients from becoming too large, avoiding unstable updates and exploding gradients.
BIt reduces the model size by pruning neurons with small gradients.
CIt increases the learning rate automatically to speed up training.
DIt normalizes the input data to have zero mean and unit variance.
Attempts:
2 left
💡 Hint

Think about what happens when gradients become very large during backpropagation.

Predict Output
intermediate
2:00remaining
Output of gradient clipping code snippet

What will be the value of clipped_norm after running this PyTorch code?

PyTorch
import torch
from torch.nn.utils import clip_grad_norm_

model_params = [torch.nn.Parameter(torch.tensor([3.0, 4.0], requires_grad=True))]
for p in model_params:
    p.grad = torch.tensor([6.0, 8.0])

clipped_norm = clip_grad_norm_(model_params, max_norm=5.0)
print(round(clipped_norm.item(), 2))
A14.0
B5.0
C1.0
D10.0
Attempts:
2 left
💡 Hint

Calculate the norm of the original gradients before clipping.

Model Choice
advanced
1:30remaining
Choosing when to apply gradient clipping

You are training two models: a shallow feedforward network and a deep recurrent neural network (RNN). Which model benefits more from gradient clipping and why?

AThe deep RNN, because it has many layers and is prone to exploding gradients during backpropagation through time.
BThe shallow feedforward network, because it has fewer layers and gradients can explode easily.
CBoth models benefit equally from gradient clipping regardless of architecture.
DNeither model benefits from gradient clipping; it is only useful for convolutional networks.
Attempts:
2 left
💡 Hint

Consider which model type is more likely to have exploding gradients.

Hyperparameter
advanced
1:30remaining
Effect of max_norm value in gradient clipping

In PyTorch's clip_grad_norm_, what happens if you set max_norm to a very small value like 0.1 during training?

AGradients will be amplified to speed up training.
BGradients will be ignored and training will proceed without updates.
CGradients will be scaled down heavily, possibly slowing or stopping learning.
DThe model will automatically increase the learning rate to compensate.
Attempts:
2 left
💡 Hint

Think about what happens when gradients are clipped to a very small norm.

🔧 Debug
expert
2:30remaining
Identifying error in gradient clipping usage

Consider this PyTorch training loop snippet. What error will it raise and why?

PyTorch
import torch
from torch.nn.utils import clip_grad_norm_

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

inputs = torch.tensor([[1.0, 2.0]])
targets = torch.tensor([[1.0]])

optimizer.zero_grad()
outputs = model(inputs)
loss = torch.nn.functional.mse_loss(outputs, targets)
loss.backward()

clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

# Next iteration without zero_grad
outputs = model(inputs)
loss = torch.nn.functional.mse_loss(outputs, targets)
loss.backward()
clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
AValueError: max_norm must be positive.
BNo error; code runs fine.
CTypeError: clip_grad_norm_ expects a list of tensors.
DRuntimeError: Trying to backward through the graph a second time without retaining it.
Attempts:
2 left
💡 Hint

Look at the second backward call without clearing gradients.

Practice

(1/5)
1. What is the main purpose of gradient clipping in PyTorch training?
easy
A. To prevent gradients from becoming too large and destabilizing training
B. To increase the learning rate automatically during training
C. To save memory by reducing model size
D. To initialize model weights before training

Solution

  1. Step 1: Understand gradient behavior during training

    Gradients can sometimes become very large, causing unstable updates and training divergence.
  2. Step 2: Role of gradient clipping

    Gradient clipping limits the size of gradients to keep training stable and prevent exploding gradients.
  3. Final Answer:

    To prevent gradients from becoming too large and destabilizing training -> Option A
  4. Quick Check:

    Gradient clipping = prevent large gradients [OK]
Hint: Gradient clipping stops gradients from exploding during training [OK]
Common Mistakes:
  • Thinking it changes learning rate
  • Confusing with weight initialization
  • Believing it reduces model size
2. Which PyTorch function is used to clip gradients by their norm?
easy
A. torch.optim.clip_grad_norm
B. torch.nn.utils.clip_grad_value_
C. torch.nn.utils.clip_grad_norm_
D. torch.clip_gradients

Solution

  1. Step 1: Recall PyTorch gradient clipping functions

    PyTorch provides two main functions: clip_grad_norm_ and clip_grad_value_ in torch.nn.utils.
  2. Step 2: Identify function for norm clipping

    clip_grad_norm_ clips gradients based on their total norm, while clip_grad_value_ clips individual gradient values.
  3. Final Answer:

    torch.nn.utils.clip_grad_norm_ -> Option C
  4. Quick Check:

    Norm clipping function = clip_grad_norm_ [OK]
Hint: clip_grad_norm_ clips total gradient size by norm [OK]
Common Mistakes:
  • Using clip_grad_value_ for norm clipping
  • Assuming optimizer has clipping functions
  • Using non-existent torch.clip_gradients
3. What will be the output of the following code snippet?
import torch
from torch.nn.utils import clip_grad_norm_

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

inputs = torch.tensor([[1.0, 2.0]])
target = torch.tensor([[1.0]])

optimizer.zero_grad()
output = model(inputs)
loss = (output - target).pow(2).mean()
loss.backward()
clip_grad_norm_(model.parameters(), max_norm=0.1)
for p in model.parameters():
    print(p.grad.norm().item())
medium
A. Code will raise an error because clip_grad_norm_ is called before backward()
B. Gradient norms will be unchanged and possibly larger than 0.1
C. Gradients will be zero because of clipping
D. All printed gradient norms will be less than or equal to 0.1

Solution

  1. Step 1: Understand code flow and gradient clipping

    Gradients are computed by loss.backward(), then clipped by clip_grad_norm_ with max_norm=0.1.
  2. Step 2: Effect of clip_grad_norm_ on gradients

    clip_grad_norm_ rescales gradients so their total norm does not exceed 0.1, so printed norms will be ≤ 0.1.
  3. Final Answer:

    All printed gradient norms will be less than or equal to 0.1 -> Option D
  4. Quick Check:

    clip_grad_norm_ limits gradient norm ≤ max_norm [OK]
Hint: clip_grad_norm_ rescales gradients after backward [OK]
Common Mistakes:
  • Calling clip_grad_norm_ before backward()
  • Expecting gradients to be zero after clipping
  • Thinking clipping increases gradient norms
4. Identify the error in this PyTorch training snippet using gradient clipping:
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
clip_grad_norm_(model.parameters(), max_norm=1.0)
loss.backward()
optimizer.step()
medium
A. clip_grad_norm_ should be called after optimizer.step()
B. clip_grad_norm_ is called before loss.backward(), so gradients are not clipped
C. max_norm should be set to 0, not 1.0
D. optimizer.zero_grad() should be called after optimizer.step()

Solution

  1. Step 1: Check order of operations for gradient clipping

    Gradients are created by loss.backward(), so clipping must happen after backward() to affect gradients.
  2. Step 2: Identify mistake in code order

    clip_grad_norm_ is called before loss.backward(), so gradients do not exist yet and clipping has no effect.
  3. Final Answer:

    clip_grad_norm_ is called before loss.backward(), so gradients are not clipped -> Option B
  4. Quick Check:

    Clip gradients after backward() [OK]
Hint: Always clip gradients after backward(), before optimizer.step() [OK]
Common Mistakes:
  • Clipping before backward()
  • Calling zero_grad() after step()
  • Setting max_norm to zero
5. You want to prevent exploding gradients in a deep RNN model. Which approach correctly applies gradient clipping in PyTorch during training?
hard
A. After loss.backward(), call torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5), then optimizer.step()
B. Before loss.backward(), call torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5), then optimizer.step()
C. After optimizer.step(), call torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
D. Call torch.nn.utils.clip_grad_value_(model.parameters(), max_norm=5) before loss.backward()

Solution

  1. Step 1: Understand correct gradient clipping sequence

    Gradients are computed by loss.backward(), so clipping must happen after this step and before optimizer.step().
  2. Step 2: Identify correct function and order

    clip_grad_norm_ is used to clip by norm, suitable for RNNs to prevent exploding gradients. It must be called after backward() and before optimizer.step().
  3. Final Answer:

    After loss.backward(), call torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5), then optimizer.step() -> Option A
  4. Quick Check:

    Clip gradients after backward(), before step() [OK]
Hint: Clip gradients after backward(), before optimizer step [OK]
Common Mistakes:
  • Clipping before backward()
  • Clipping after optimizer.step()
  • Using clip_grad_value_ incorrectly