Overview - Mixed precision training (AMP)

What is it?

Mixed precision training is a technique that uses both 16-bit and 32-bit numbers to train deep learning models. It speeds up training and reduces memory use by doing most calculations in 16-bit, but keeps some important parts in 32-bit to stay accurate. Automatic Mixed Precision (AMP) is a tool that helps do this automatically without changing much code. It makes training faster and cheaper while keeping model quality high.

Why it matters

Training deep learning models can be very slow and use a lot of computer memory, which costs time and money. Without mixed precision, training large models might be impossible on some hardware. Mixed precision training solves this by making training faster and less memory hungry, so researchers and engineers can build better AI models more efficiently. Without it, progress in AI would be slower and more expensive.

Where it fits

Before learning mixed precision training, you should understand basic deep learning training loops, floating point numbers, and PyTorch tensors. After mastering mixed precision, you can explore advanced optimization techniques, distributed training, and hardware-specific performance tuning.

Mental Model

Core Idea

Mixed precision training speeds up deep learning by using faster, smaller numbers where possible, while keeping accuracy with full precision where needed.

Think of it like...

It's like writing a letter with a pencil for most words to write quickly, but using a pen for important parts to make sure they don't smudge or fade.

┌───────────────────────────────┐
│        Mixed Precision        │
├─────────────┬───────────────┤
│ 16-bit (FP16)│ 32-bit (FP32) │
├─────────────┼───────────────┤
│ Fast math   │ Accurate math │
│ Less memory │ Stable updates│
└─────────────┴───────────────┘
          ↓
┌───────────────────────────────┐
│   Automatic Mixed Precision    │
│  (Manages when to use each)    │
└───────────────────────────────┘
          ↓
┌───────────────────────────────┐
│ Faster training, less memory   │
│ Same model quality             │
└───────────────────────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding Floating Point Numbers

Concept: Learn what floating point numbers are and why different precisions exist.

Computers store numbers in a format called floating point. The two common types are 32-bit (FP32) and 16-bit (FP16). FP32 uses more bits, so it can represent numbers more precisely and over a wider range. FP16 uses fewer bits, so it is faster and uses less memory but can lose some detail.

Result

You understand that FP16 is faster but less precise than FP32.

Knowing the difference between FP16 and FP32 helps you see why mixing them can speed up training without losing too much accuracy.

2

FoundationBasics of Deep Learning Training

3

IntermediateWhy Use Mixed Precision Training

4

IntermediateHow Automatic Mixed Precision (AMP) Works

5

IntermediateImplementing AMP in PyTorch

6

AdvancedLoss Scaling to Prevent Underflow

7

ExpertAMP Internals and Performance Tradeoffs

Under the Hood

Mixed precision training works by running most tensor operations in 16-bit floating point (FP16) to speed up computation and reduce memory use. However, some operations like weight updates and loss calculations remain in 32-bit (FP32) to maintain numerical stability. AMP automates this by wrapping operations and managing when to cast tensors between FP16 and FP32. It also uses loss scaling to prevent small gradient values from becoming zero due to FP16's limited range.

Why designed this way?

Mixed precision was designed to leverage modern GPUs' hardware capabilities, especially Tensor Cores optimized for FP16 math. Early attempts to use only FP16 failed due to numerical instability. AMP was created to automate the complex decision of which operations can safely use FP16 and which need FP32, reducing developer effort and errors. This design balances speed, memory savings, and model accuracy.

┌───────────────────────────────┐
│       Training Loop Start      │
└──────────────┬────────────────┘
               │
       ┌───────▼────────┐
       │ Forward Pass    │
       │ (autocast FP16) │
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Loss Compute   │
       │ (FP32 for safe)│
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Loss Scaling   │
       │ (scale up)     │
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Backward Pass  │
       │ (autocast FP16)│
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Unscale Grad   │
       │ (scale down)   │
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Optimizer Step │
       │ (FP32 weights) │
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Training Loop  │
       │ Repeat         │
       └───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does using FP16 everywhere always make training faster and better? Commit yes or no.

Common Belief:Using FP16 for all calculations will always speed up training without any problems.

Tap to reveal reality

Quick: Do you think AMP requires rewriting your entire training code? Commit yes or no.

Common Belief:AMP needs major code changes and manual casting everywhere to work.

Tap to reveal reality

Quick: Is mixed precision training only useful on the newest GPUs? Commit yes or no.

Common Belief:Mixed precision training only works or is beneficial on the latest GPU hardware.

Tap to reveal reality

Quick: Does loss scaling only make training slower? Commit yes or no.

Common Belief:Loss scaling is just extra overhead that slows down training.

Tap to reveal reality

Expert Zone

1

AMP uses operation-level precision decisions based on a whitelist/blacklist, which can be customized for specific models or hardware.

2

Dynamic loss scaling adjusts the scale factor during training to maximize precision without causing overflow, improving stability automatically.

3

Some layers or custom operations may not be AMP-compatible and require manual intervention or custom autocast contexts.

When NOT to use

Mixed precision training is not ideal when training very small models where overhead outweighs benefits, or on hardware without FP16 support. For extremely sensitive numerical tasks, full FP32 or even higher precision may be necessary. Alternatives include manual mixed precision or using bfloat16 on supported hardware.

Production Patterns

In production, AMP is often combined with distributed training and gradient checkpointing to maximize speed and memory efficiency. Engineers monitor training stability closely and may customize AMP behavior for custom layers. AMP is standard in many state-of-the-art model training pipelines to reduce costs and accelerate iteration.

Connections

Floating Point Arithmetic

Mixed precision training builds directly on floating point number formats and their precision limits.

Understanding floating point arithmetic helps grasp why some operations need higher precision and why loss scaling is necessary.

Hardware Acceleration (GPU Tensor Cores)

Mixed precision training leverages specialized hardware units designed for FP16 math to speed up computation.

Knowing how GPUs accelerate FP16 operations explains the performance gains and hardware dependencies of AMP.

Numerical Stability in Scientific Computing

Mixed precision training addresses numerical stability challenges common in scientific calculations with limited precision.

Recognizing parallels with numerical stability techniques in other fields helps appreciate the design of loss scaling and precision management.

Common Pitfalls

#1Training with FP16 everywhere without loss scaling causes gradients to become zero.

Wrong approach:with torch.cuda.amp.autocast(): output = model(input) loss = loss_fn(output, target) loss.backward() optimizer.step()

Correct approach:scaler = torch.cuda.amp.GradScaler() with torch.cuda.amp.autocast(): output = model(input) loss = loss_fn(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()

Root cause:Not scaling the loss means small gradients underflow in FP16, becoming zero and stopping learning.

#2Manually casting all tensors to FP16 without AMP causes instability and errors.

Wrong approach:input = input.half() model = model.half() output = model(input) loss = loss_fn(output, target) loss.backward() optimizer.step()

Correct approach:Use AMP's autocast and GradScaler instead of manual casting to handle precision safely.

Root cause:Manual casting misses critical FP32 operations and loss scaling, causing numerical problems.

#3Assuming AMP will always speed up training regardless of GPU type.

Wrong approach:# Using AMP on very old GPU expecting big speedup with torch.cuda.amp.autocast(): output = model(input) loss = loss_fn(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()

Correct approach:Check GPU capabilities; on older GPUs, AMP may improve memory but not speed significantly.

Root cause:Not understanding hardware limits leads to unrealistic expectations and confusion.

Key Takeaways

Mixed precision training uses both 16-bit and 32-bit numbers to speed up deep learning while keeping accuracy.

Automatic Mixed Precision (AMP) automates precision management and loss scaling, making mixed precision easy to use.

Loss scaling is essential to prevent small gradient values from disappearing in 16-bit precision.

AMP's benefits depend on hardware support, especially GPUs with Tensor Cores.

Understanding floating point limits and numerical stability is key to using mixed precision effectively.