Overview - Optimizers (SGD, Adam)

What is it?

Optimizers are tools that help a machine learning model learn by adjusting its settings to make better predictions. They decide how to change the model's internal numbers step-by-step to reduce mistakes. Two popular optimizers are SGD (Stochastic Gradient Descent) and Adam, each with different ways to update these numbers. Optimizers are essential for training models efficiently and accurately.

Why it matters

Without optimizers, models would not know how to improve from their errors, making learning impossible or extremely slow. Optimizers solve the problem of finding the best settings for a model to perform well on new data. This impacts everything from voice assistants to medical diagnosis tools, making AI smarter and more reliable.

Where it fits

Before learning optimizers, you should understand what a model is and how it makes predictions, especially the concept of loss or error. After optimizers, learners usually study learning rate schedules, regularization, and advanced training techniques to improve model performance further.

Mental Model

Core Idea

An optimizer is like a smart guide that tells the model how to change its settings step-by-step to make fewer mistakes.

Think of it like...

Imagine climbing down a mountain blindfolded to reach the lowest point. SGD takes small steps based on the slope under your feet, while Adam remembers past slopes to choose better steps faster.

Model Parameters
    ↓
Calculate Loss (Error)
    ↓
Compute Gradients (Slopes)
    ↓
Optimizer Updates Parameters
    ↓
Repeat until loss is low

[SGD] uses current slope only
[Adam] uses current + past slopes

Build-Up - 7 Steps

1

FoundationWhat is an optimizer in ML

Concept: Introduces the basic role of an optimizer in machine learning.

In machine learning, a model tries to make predictions. It measures how wrong it is using a loss function. The optimizer changes the model's settings (parameters) to reduce this loss. Think of it as a helper that nudges the model to improve.

Result

The model's parameters start changing to reduce errors.

Understanding that optimizers guide the model's learning process is key to grasping how training works.

2

FoundationGradient descent basics

3

IntermediateStochastic Gradient Descent (SGD)

4

IntermediateAdam optimizer basics

5

IntermediateLearning rate and its impact

6

AdvancedAdam's bias correction explained

7

ExpertWhen Adam can fail and SGD shines

Under the Hood

Optimizers work by calculating gradients of the loss with respect to each parameter, then adjusting parameters to reduce loss. SGD uses the current gradient from a mini-batch to update parameters directly. Adam maintains moving averages of past gradients (first moment) and squared gradients (second moment) to adaptively scale updates per parameter. It also applies bias correction to these averages to avoid initial bias. These calculations happen at each training step, enabling the model to gradually improve.

Why designed this way?

SGD was designed for efficiency by using small data batches to speed up training and introduce helpful noise. Adam was created to combine the benefits of momentum and adaptive learning rates, addressing SGD's sensitivity to learning rate tuning. Alternatives like RMSProp or Adagrad existed but Adam's bias correction and combined moments made it more robust and widely adopted.

┌───────────────┐
│   Data Batch  │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│  Compute Loss │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Compute Gradient│
└──────┬────────┘
       │
       ▼
┌───────────────┐          ┌───────────────┐
│    SGD Update │─────────▶│ Update Params │
└───────────────┘          └───────────────┘
       │
       │
       ▼
┌───────────────┐          ┌───────────────┐
│   Adam Update │─────────▶│ Update Params │
│ (momentums +  │          └───────────────┘
│ bias correct) │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does a higher learning rate always speed up training without issues? Commit to yes or no.

Common Belief:A higher learning rate always makes the model learn faster and better.

Tap to reveal reality

Quick: Does Adam always produce better final models than SGD? Commit to yes or no.

Common Belief:Adam is always better than SGD because it adapts learning rates automatically.

Tap to reveal reality

Quick: Does SGD use the entire dataset to compute gradients every step? Commit to yes or no.

Common Belief:SGD calculates gradients using the whole dataset each time it updates parameters.

Tap to reveal reality

Quick: Does Adam's moving averages start unbiased from the first step? Commit to yes or no.

Common Belief:Adam's moving averages are accurate from the very first update.

Tap to reveal reality

Expert Zone

1

Adam's adaptive learning rates can sometimes cause it to converge to sharp minima, which may generalize worse than SGD's flatter minima.

2

SGD with momentum can be seen as a low-pass filter smoothing gradients, which helps escape noisy updates better than plain SGD.

3

The choice of optimizer interacts with batch size and learning rate schedules, affecting training dynamics in subtle ways.

When NOT to use

Adam may not be ideal for very large datasets or when final model generalization is critical; in such cases, SGD with momentum or newer optimizers like AdamW or Ranger might be better. SGD is less suitable for sparse data or noisy gradients where adaptive methods excel.

Production Patterns

In practice, many teams start training with Adam for fast convergence, then switch to SGD with momentum for fine-tuning. Learning rate warm-up and decay schedules are combined with optimizers to stabilize training. Custom optimizers or hybrid approaches are used for specialized tasks like NLP or reinforcement learning.

Connections

Control Systems

Both use feedback loops to adjust system behavior toward a goal.

Understanding optimizers as feedback controllers helps grasp how they correct model errors iteratively.

Economics - Gradient-based Optimization in Resource Allocation

Both optimize a function by adjusting variables to maximize or minimize outcomes.

Seeing optimizers as economic decision-makers clarifies their role in balancing trade-offs during learning.

Human Learning and Skill Improvement

Both involve iterative adjustments based on feedback to improve performance.

Recognizing that optimizers mimic how humans learn from mistakes deepens intuition about training dynamics.

Common Pitfalls

#1Using a fixed high learning rate causing training to diverge.

Wrong approach:optimizer = torch.optim.SGD(model.parameters(), lr=1.0) for epoch in range(10): optimizer.zero_grad() loss = compute_loss() loss.backward() optimizer.step()

Correct approach:optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for epoch in range(10): optimizer.zero_grad() loss = compute_loss() loss.backward() optimizer.step()

Root cause:Misunderstanding that too large a learning rate causes unstable updates and prevents convergence.

#2Confusing SGD with full batch gradient descent and expecting slow updates.

Wrong approach:Using full dataset gradients every step: for epoch in range(10): optimizer.zero_grad() loss = compute_loss(full_dataset) loss.backward() optimizer.step()

Correct approach:Using mini-batches for SGD: for epoch in range(10): for batch in data_loader: optimizer.zero_grad() loss = compute_loss(batch) loss.backward() optimizer.step()

Root cause:Not realizing SGD uses mini-batches, which speeds up training and adds helpful noise.

#3Ignoring Adam's bias correction leading to poor early training.

Wrong approach:Implementing Adam without bias correction: # moving averages updated but no bias fix applied m = beta1 * m + (1 - beta1) * grad v = beta2 * v + (1 - beta2) * grad**2 param -= lr * m / (sqrt(v) + eps)

Correct approach:Implementing Adam with bias correction: m_hat = m / (1 - beta1**t) v_hat = v / (1 - beta2**t) param -= lr * m_hat / (sqrt(v_hat) + eps)

Root cause:Overlooking the need to correct initial bias in moving averages causes inaccurate updates early on.

Key Takeaways

Optimizers guide machine learning models to improve by adjusting parameters to reduce errors step-by-step.

SGD updates parameters using gradients from small random batches, making training faster and introducing helpful noise.

Adam improves on SGD by adapting learning rates per parameter using past gradient information and bias correction.

Choosing the right optimizer and tuning learning rates critically affects training speed and final model quality.

Understanding optimizer mechanics and limitations enables better training strategies and improved model performance.