0
0
PyTorchml~15 mins

Mixed precision training (AMP) in PyTorch - Deep Dive

Choose your learning style9 modes available
Overview - Mixed precision training (AMP)
What is it?
Mixed precision training is a technique that uses both 16-bit and 32-bit numbers to train deep learning models. It speeds up training and reduces memory use by doing most calculations in 16-bit, but keeps some important parts in 32-bit to stay accurate. Automatic Mixed Precision (AMP) is a tool that helps do this automatically without changing much code. It makes training faster and cheaper while keeping model quality high.
Why it matters
Training deep learning models can be very slow and use a lot of computer memory, which costs time and money. Without mixed precision, training large models might be impossible on some hardware. Mixed precision training solves this by making training faster and less memory hungry, so researchers and engineers can build better AI models more efficiently. Without it, progress in AI would be slower and more expensive.
Where it fits
Before learning mixed precision training, you should understand basic deep learning training loops, floating point numbers, and PyTorch tensors. After mastering mixed precision, you can explore advanced optimization techniques, distributed training, and hardware-specific performance tuning.
Mental Model
Core Idea
Mixed precision training speeds up deep learning by using faster, smaller numbers where possible, while keeping accuracy with full precision where needed.
Think of it like...
It's like writing a letter with a pencil for most words to write quickly, but using a pen for important parts to make sure they don't smudge or fade.
┌───────────────────────────────┐
│        Mixed Precision        │
├─────────────┬───────────────┤
│ 16-bit (FP16)│ 32-bit (FP32) │
├─────────────┼───────────────┤
│ Fast math   │ Accurate math │
│ Less memory │ Stable updates│
└─────────────┴───────────────┘
          ↓
┌───────────────────────────────┐
│   Automatic Mixed Precision    │
│  (Manages when to use each)    │
└───────────────────────────────┘
          ↓
┌───────────────────────────────┐
│ Faster training, less memory   │
│ Same model quality             │
└───────────────────────────────┘
Build-Up - 7 Steps
1
FoundationUnderstanding Floating Point Numbers
🤔
Concept: Learn what floating point numbers are and why different precisions exist.
Computers store numbers in a format called floating point. The two common types are 32-bit (FP32) and 16-bit (FP16). FP32 uses more bits, so it can represent numbers more precisely and over a wider range. FP16 uses fewer bits, so it is faster and uses less memory but can lose some detail.
Result
You understand that FP16 is faster but less precise than FP32.
Knowing the difference between FP16 and FP32 helps you see why mixing them can speed up training without losing too much accuracy.
2
FoundationBasics of Deep Learning Training
🤔
Concept: Understand how models learn by adjusting weights using gradients and loss.
Training a neural network means changing its weights to reduce errors. This uses a process called backpropagation, which calculates gradients (how much to change each weight). These calculations usually use FP32 for accuracy.
Result
You see that training needs many precise calculations to update model weights correctly.
Recognizing that training relies on precise math explains why simply switching to FP16 everywhere can cause problems.
3
IntermediateWhy Use Mixed Precision Training
🤔Before reading on: Do you think using only FP16 will always speed up training without any downsides? Commit to your answer.
Concept: Learn the benefits and challenges of using FP16 and FP32 together during training.
Using only FP16 can speed up training and save memory, but it can cause errors because FP16 can't represent very small or very large numbers well. Mixed precision training uses FP16 for most math but keeps FP32 for critical parts like weight updates to avoid errors.
Result
You understand that mixed precision balances speed and accuracy by combining FP16 and FP32.
Knowing the tradeoff between speed and precision helps you appreciate why mixed precision is a smart compromise.
4
IntermediateHow Automatic Mixed Precision (AMP) Works
🤔Before reading on: Do you think AMP requires rewriting your entire training code? Commit to your answer.
Concept: AMP automatically chooses which operations use FP16 or FP32 during training.
AMP is a tool in PyTorch that wraps your training code. It runs most operations in FP16 for speed but switches to FP32 for sensitive operations like loss scaling and weight updates. This automation means you don't have to manually change your code to use mixed precision.
Result
You see that AMP makes mixed precision easy and safe to use.
Understanding AMP's automation reduces the barrier to adopting mixed precision in real projects.
5
IntermediateImplementing AMP in PyTorch
🤔
Concept: Learn the simple code changes needed to enable AMP in PyTorch training loops.
In PyTorch, you import torch.cuda.amp and use autocast() to run forward passes in mixed precision. You also use GradScaler() to scale gradients and avoid small number errors. This requires only a few lines added to your existing training loop.
Result
You can modify a standard training loop to use AMP and see faster training with less memory use.
Knowing the minimal code changes needed makes AMP practical for everyday use.
6
AdvancedLoss Scaling to Prevent Underflow
🤔Before reading on: Do you think gradients in FP16 can always represent very small values accurately? Commit to your answer.
Concept: Learn why scaling the loss helps keep gradients in a safe range during FP16 training.
FP16 has a smaller range than FP32, so very small gradient values can become zero (underflow). Loss scaling multiplies the loss by a big number before backpropagation, making gradients larger and safe to represent. After gradients are computed, they are scaled back down before updating weights.
Result
You understand how loss scaling prevents training from breaking due to tiny gradients.
Knowing loss scaling is key to stable mixed precision training and why AMP includes it automatically.
7
ExpertAMP Internals and Performance Tradeoffs
🤔Before reading on: Do you think AMP always improves training speed regardless of hardware? Commit to your answer.
Concept: Explore how AMP decides precision per operation and hardware factors affecting speed gains.
AMP uses a whitelist and blacklist of operations to decide which run in FP16 or FP32. Some ops are unsafe in FP16 and always run in FP32. The speedup depends on GPU architecture; newer GPUs with Tensor Cores benefit more. Also, memory bandwidth and model size affect gains. AMP balances precision and speed dynamically.
Result
You see that AMP's effectiveness depends on hardware and model details, not just code changes.
Understanding AMP internals helps optimize training setups and avoid surprises in performance.
Under the Hood
Mixed precision training works by running most tensor operations in 16-bit floating point (FP16) to speed up computation and reduce memory use. However, some operations like weight updates and loss calculations remain in 32-bit (FP32) to maintain numerical stability. AMP automates this by wrapping operations and managing when to cast tensors between FP16 and FP32. It also uses loss scaling to prevent small gradient values from becoming zero due to FP16's limited range.
Why designed this way?
Mixed precision was designed to leverage modern GPUs' hardware capabilities, especially Tensor Cores optimized for FP16 math. Early attempts to use only FP16 failed due to numerical instability. AMP was created to automate the complex decision of which operations can safely use FP16 and which need FP32, reducing developer effort and errors. This design balances speed, memory savings, and model accuracy.
┌───────────────────────────────┐
│       Training Loop Start      │
└──────────────┬────────────────┘
               │
       ┌───────▼────────┐
       │ Forward Pass    │
       │ (autocast FP16) │
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Loss Compute   │
       │ (FP32 for safe)│
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Loss Scaling   │
       │ (scale up)     │
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Backward Pass  │
       │ (autocast FP16)│
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Unscale Grad   │
       │ (scale down)   │
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Optimizer Step │
       │ (FP32 weights) │
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │ Training Loop  │
       │ Repeat         │
       └───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does using FP16 everywhere always make training faster and better? Commit yes or no.
Common Belief:Using FP16 for all calculations will always speed up training without any problems.
Tap to reveal reality
Reality:Using FP16 everywhere can cause numerical errors like underflow or overflow, leading to unstable training or poor model quality.
Why it matters:Ignoring this can cause training to fail silently or produce bad models, wasting time and resources.
Quick: Do you think AMP requires rewriting your entire training code? Commit yes or no.
Common Belief:AMP needs major code changes and manual casting everywhere to work.
Tap to reveal reality
Reality:AMP automates casting and loss scaling, requiring only small code additions around the training loop.
Why it matters:Believing otherwise may discourage people from using AMP and missing out on its benefits.
Quick: Is mixed precision training only useful on the newest GPUs? Commit yes or no.
Common Belief:Mixed precision training only works or is beneficial on the latest GPU hardware.
Tap to reveal reality
Reality:While newer GPUs with Tensor Cores get the most speedup, AMP can still improve memory use and sometimes speed on older GPUs.
Why it matters:Thinking it's useless on older hardware may prevent wider adoption and efficiency gains.
Quick: Does loss scaling only make training slower? Commit yes or no.
Common Belief:Loss scaling is just extra overhead that slows down training.
Tap to reveal reality
Reality:Loss scaling is essential to keep gradients in a safe range and prevent training failure; its overhead is minimal compared to the benefits.
Why it matters:Misunderstanding this can lead to disabling loss scaling and unstable training.
Expert Zone
1
AMP uses operation-level precision decisions based on a whitelist/blacklist, which can be customized for specific models or hardware.
2
Dynamic loss scaling adjusts the scale factor during training to maximize precision without causing overflow, improving stability automatically.
3
Some layers or custom operations may not be AMP-compatible and require manual intervention or custom autocast contexts.
When NOT to use
Mixed precision training is not ideal when training very small models where overhead outweighs benefits, or on hardware without FP16 support. For extremely sensitive numerical tasks, full FP32 or even higher precision may be necessary. Alternatives include manual mixed precision or using bfloat16 on supported hardware.
Production Patterns
In production, AMP is often combined with distributed training and gradient checkpointing to maximize speed and memory efficiency. Engineers monitor training stability closely and may customize AMP behavior for custom layers. AMP is standard in many state-of-the-art model training pipelines to reduce costs and accelerate iteration.
Connections
Floating Point Arithmetic
Mixed precision training builds directly on floating point number formats and their precision limits.
Understanding floating point arithmetic helps grasp why some operations need higher precision and why loss scaling is necessary.
Hardware Acceleration (GPU Tensor Cores)
Mixed precision training leverages specialized hardware units designed for FP16 math to speed up computation.
Knowing how GPUs accelerate FP16 operations explains the performance gains and hardware dependencies of AMP.
Numerical Stability in Scientific Computing
Mixed precision training addresses numerical stability challenges common in scientific calculations with limited precision.
Recognizing parallels with numerical stability techniques in other fields helps appreciate the design of loss scaling and precision management.
Common Pitfalls
#1Training with FP16 everywhere without loss scaling causes gradients to become zero.
Wrong approach:with torch.cuda.amp.autocast(): output = model(input) loss = loss_fn(output, target) loss.backward() optimizer.step()
Correct approach:scaler = torch.cuda.amp.GradScaler() with torch.cuda.amp.autocast(): output = model(input) loss = loss_fn(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Root cause:Not scaling the loss means small gradients underflow in FP16, becoming zero and stopping learning.
#2Manually casting all tensors to FP16 without AMP causes instability and errors.
Wrong approach:input = input.half() model = model.half() output = model(input) loss = loss_fn(output, target) loss.backward() optimizer.step()
Correct approach:Use AMP's autocast and GradScaler instead of manual casting to handle precision safely.
Root cause:Manual casting misses critical FP32 operations and loss scaling, causing numerical problems.
#3Assuming AMP will always speed up training regardless of GPU type.
Wrong approach:# Using AMP on very old GPU expecting big speedup with torch.cuda.amp.autocast(): output = model(input) loss = loss_fn(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Correct approach:Check GPU capabilities; on older GPUs, AMP may improve memory but not speed significantly.
Root cause:Not understanding hardware limits leads to unrealistic expectations and confusion.
Key Takeaways
Mixed precision training uses both 16-bit and 32-bit numbers to speed up deep learning while keeping accuracy.
Automatic Mixed Precision (AMP) automates precision management and loss scaling, making mixed precision easy to use.
Loss scaling is essential to prevent small gradient values from disappearing in 16-bit precision.
AMP's benefits depend on hardware support, especially GPUs with Tensor Cores.
Understanding floating point limits and numerical stability is key to using mixed precision effectively.