0
0
PyTorchml~15 mins

Optimizers (SGD, Adam) in PyTorch - Deep Dive

Choose your learning style9 modes available
Overview - Optimizers (SGD, Adam)
What is it?
Optimizers are tools that help a machine learning model learn by adjusting its settings to make better predictions. They decide how to change the model's internal numbers step-by-step to reduce mistakes. Two popular optimizers are SGD (Stochastic Gradient Descent) and Adam, each with different ways to update these numbers. Optimizers are essential for training models efficiently and accurately.
Why it matters
Without optimizers, models would not know how to improve from their errors, making learning impossible or extremely slow. Optimizers solve the problem of finding the best settings for a model to perform well on new data. This impacts everything from voice assistants to medical diagnosis tools, making AI smarter and more reliable.
Where it fits
Before learning optimizers, you should understand what a model is and how it makes predictions, especially the concept of loss or error. After optimizers, learners usually study learning rate schedules, regularization, and advanced training techniques to improve model performance further.
Mental Model
Core Idea
An optimizer is like a smart guide that tells the model how to change its settings step-by-step to make fewer mistakes.
Think of it like...
Imagine climbing down a mountain blindfolded to reach the lowest point. SGD takes small steps based on the slope under your feet, while Adam remembers past slopes to choose better steps faster.
Model Parameters
    ↓
Calculate Loss (Error)
    ↓
Compute Gradients (Slopes)
    ↓
Optimizer Updates Parameters
    ↓
Repeat until loss is low

[SGD] uses current slope only
[Adam] uses current + past slopes
Build-Up - 7 Steps
1
FoundationWhat is an optimizer in ML
🤔
Concept: Introduces the basic role of an optimizer in machine learning.
In machine learning, a model tries to make predictions. It measures how wrong it is using a loss function. The optimizer changes the model's settings (parameters) to reduce this loss. Think of it as a helper that nudges the model to improve.
Result
The model's parameters start changing to reduce errors.
Understanding that optimizers guide the model's learning process is key to grasping how training works.
2
FoundationGradient descent basics
🤔
Concept: Explains how gradients show the direction to improve the model.
Gradient descent is a method where the model looks at the slope (gradient) of the loss to know which way to change parameters. It moves parameters opposite to the slope to reduce loss step-by-step.
Result
Parameters move closer to values that reduce error.
Knowing gradients point the way to better parameters helps understand why optimizers use them.
3
IntermediateStochastic Gradient Descent (SGD)
🤔Before reading on: Do you think SGD uses all data at once or small parts to update parameters? Commit to your answer.
Concept: Introduces SGD as a version of gradient descent using small data batches.
SGD updates model parameters using gradients from small random samples (batches) of data instead of the whole dataset. This makes updates faster and adds some randomness that can help escape bad solutions.
Result
Model learns faster with noisy but frequent updates.
Understanding SGD's use of small batches explains why training can be faster and sometimes more effective.
4
IntermediateAdam optimizer basics
🤔Before reading on: Do you think Adam treats all gradients equally or remembers past gradients? Commit to your answer.
Concept: Explains Adam optimizer's use of past gradient information to improve updates.
Adam keeps track of past gradients and their squares to adjust the step size for each parameter individually. This helps the model learn faster and more reliably, especially on complex problems.
Result
Parameters update adaptively, often leading to quicker convergence.
Knowing Adam adapts learning rates per parameter clarifies why it often outperforms simpler methods.
5
IntermediateLearning rate and its impact
🤔Before reading on: Is a higher learning rate always better for faster learning? Commit to your answer.
Concept: Discusses the importance of the learning rate in optimizer performance.
The learning rate controls how big each parameter update is. Too big can cause the model to jump around and not learn well; too small makes learning slow. Both SGD and Adam need a good learning rate to work well.
Result
Proper learning rate leads to stable and efficient training.
Understanding learning rate effects helps avoid common training failures.
6
AdvancedAdam's bias correction explained
🤔Before reading on: Do you think Adam's moment estimates start unbiased or biased? Commit to your answer.
Concept: Explains why Adam uses bias correction for its moving averages.
Adam calculates moving averages of gradients but these averages start biased towards zero at the beginning. To fix this, Adam applies bias correction to get accurate estimates, improving early training steps.
Result
More reliable parameter updates especially at training start.
Knowing about bias correction reveals why Adam is stable and effective from the first step.
7
ExpertWhen Adam can fail and SGD shines
🤔Before reading on: Do you think Adam always outperforms SGD? Commit to your answer.
Concept: Discusses scenarios where SGD may outperform Adam despite Adam's advantages.
Though Adam adapts quickly, it can sometimes lead to worse final results or overfit. SGD with momentum often generalizes better on some tasks like image recognition. Experts sometimes switch from Adam to SGD during training.
Result
Choosing the right optimizer or switching can improve final model quality.
Understanding optimizer strengths and weaknesses guides better training strategies.
Under the Hood
Optimizers work by calculating gradients of the loss with respect to each parameter, then adjusting parameters to reduce loss. SGD uses the current gradient from a mini-batch to update parameters directly. Adam maintains moving averages of past gradients (first moment) and squared gradients (second moment) to adaptively scale updates per parameter. It also applies bias correction to these averages to avoid initial bias. These calculations happen at each training step, enabling the model to gradually improve.
Why designed this way?
SGD was designed for efficiency by using small data batches to speed up training and introduce helpful noise. Adam was created to combine the benefits of momentum and adaptive learning rates, addressing SGD's sensitivity to learning rate tuning. Alternatives like RMSProp or Adagrad existed but Adam's bias correction and combined moments made it more robust and widely adopted.
┌───────────────┐
│   Data Batch  │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│  Compute Loss │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Compute Gradient│
└──────┬────────┘
       │
       ▼
┌───────────────┐          ┌───────────────┐
│    SGD Update │─────────▶│ Update Params │
└───────────────┘          └───────────────┘
       │
       │
       ▼
┌───────────────┐          ┌───────────────┐
│   Adam Update │─────────▶│ Update Params │
│ (momentums +  │          └───────────────┘
│ bias correct) │
└───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does a higher learning rate always speed up training without issues? Commit to yes or no.
Common Belief:A higher learning rate always makes the model learn faster and better.
Tap to reveal reality
Reality:Too high a learning rate can cause the model to overshoot minima, making training unstable or fail.
Why it matters:Using too high learning rates wastes time and resources and can prevent the model from learning at all.
Quick: Does Adam always produce better final models than SGD? Commit to yes or no.
Common Belief:Adam is always better than SGD because it adapts learning rates automatically.
Tap to reveal reality
Reality:Adam often learns faster but SGD with momentum can achieve better final accuracy and generalization on some tasks.
Why it matters:Blindly choosing Adam may lead to suboptimal models; knowing when to use SGD improves results.
Quick: Does SGD use the entire dataset to compute gradients every step? Commit to yes or no.
Common Belief:SGD calculates gradients using the whole dataset each time it updates parameters.
Tap to reveal reality
Reality:SGD uses small random batches (mini-batches), not the full dataset, to compute gradients each step.
Why it matters:
Quick: Does Adam's moving averages start unbiased from the first step? Commit to yes or no.
Common Belief:Adam's moving averages are accurate from the very first update.
Tap to reveal reality
Reality:They start biased towards zero and require bias correction for accurate updates early in training.
Why it matters:Ignoring bias correction can cause poor early training behavior and slower convergence.
Expert Zone
1
Adam's adaptive learning rates can sometimes cause it to converge to sharp minima, which may generalize worse than SGD's flatter minima.
2
SGD with momentum can be seen as a low-pass filter smoothing gradients, which helps escape noisy updates better than plain SGD.
3
The choice of optimizer interacts with batch size and learning rate schedules, affecting training dynamics in subtle ways.
When NOT to use
Adam may not be ideal for very large datasets or when final model generalization is critical; in such cases, SGD with momentum or newer optimizers like AdamW or Ranger might be better. SGD is less suitable for sparse data or noisy gradients where adaptive methods excel.
Production Patterns
In practice, many teams start training with Adam for fast convergence, then switch to SGD with momentum for fine-tuning. Learning rate warm-up and decay schedules are combined with optimizers to stabilize training. Custom optimizers or hybrid approaches are used for specialized tasks like NLP or reinforcement learning.
Connections
Control Systems
Both use feedback loops to adjust system behavior toward a goal.
Understanding optimizers as feedback controllers helps grasp how they correct model errors iteratively.
Economics - Gradient-based Optimization in Resource Allocation
Both optimize a function by adjusting variables to maximize or minimize outcomes.
Seeing optimizers as economic decision-makers clarifies their role in balancing trade-offs during learning.
Human Learning and Skill Improvement
Both involve iterative adjustments based on feedback to improve performance.
Recognizing that optimizers mimic how humans learn from mistakes deepens intuition about training dynamics.
Common Pitfalls
#1Using a fixed high learning rate causing training to diverge.
Wrong approach:optimizer = torch.optim.SGD(model.parameters(), lr=1.0) for epoch in range(10): optimizer.zero_grad() loss = compute_loss() loss.backward() optimizer.step()
Correct approach:optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for epoch in range(10): optimizer.zero_grad() loss = compute_loss() loss.backward() optimizer.step()
Root cause:Misunderstanding that too large a learning rate causes unstable updates and prevents convergence.
#2Confusing SGD with full batch gradient descent and expecting slow updates.
Wrong approach:Using full dataset gradients every step: for epoch in range(10): optimizer.zero_grad() loss = compute_loss(full_dataset) loss.backward() optimizer.step()
Correct approach:Using mini-batches for SGD: for epoch in range(10): for batch in data_loader: optimizer.zero_grad() loss = compute_loss(batch) loss.backward() optimizer.step()
Root cause:Not realizing SGD uses mini-batches, which speeds up training and adds helpful noise.
#3Ignoring Adam's bias correction leading to poor early training.
Wrong approach:Implementing Adam without bias correction: # moving averages updated but no bias fix applied m = beta1 * m + (1 - beta1) * grad v = beta2 * v + (1 - beta2) * grad**2 param -= lr * m / (sqrt(v) + eps)
Correct approach:Implementing Adam with bias correction: m_hat = m / (1 - beta1**t) v_hat = v / (1 - beta2**t) param -= lr * m_hat / (sqrt(v_hat) + eps)
Root cause:Overlooking the need to correct initial bias in moving averages causes inaccurate updates early on.
Key Takeaways
Optimizers guide machine learning models to improve by adjusting parameters to reduce errors step-by-step.
SGD updates parameters using gradients from small random batches, making training faster and introducing helpful noise.
Adam improves on SGD by adapting learning rates per parameter using past gradient information and bias correction.
Choosing the right optimizer and tuning learning rates critically affects training speed and final model quality.
Understanding optimizer mechanics and limitations enables better training strategies and improved model performance.