DistributedDataParallel helps train machine learning models faster by using many computers or GPUs at the same time. It splits the work so the model learns quicker.
0
0
DistributedDataParallel in PyTorch
Introduction
You want to train a large neural network faster by using multiple GPUs.
You have a big dataset that takes too long to train on one machine.
You want to scale your training across several computers in a cluster.
You want to keep your model synchronized across devices during training.
You need to reduce training time for deep learning projects.
Syntax
PyTorch
import torch import torch.nn as nn import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP # Initialize process group for communication dist.init_process_group(backend='nccl') # Create model and move it to GPU model = MyModel().to(device) # Wrap model with DistributedDataParallel model = DDP(model, device_ids=[device_id])
You must initialize the process group before using DistributedDataParallel.
Each process should handle one GPU and wrap the model separately.
Examples
Wraps the model on GPU 0 for distributed training on a single machine with one GPU.
PyTorch
model = DDP(model, device_ids=[0])Wraps the model on GPU matching the process rank, useful in multi-GPU multi-process setups.
PyTorch
model = DDP(model, device_ids=[rank], output_device=rank)
Initializes the process group using environment variables and the Gloo backend, often used for CPU or cross-machine training.
PyTorch
dist.init_process_group(backend='gloo', init_method='env://')
Sample Model
This code sets up a simple linear model wrapped with DistributedDataParallel on one GPU. It runs one training step and prints the loss.
PyTorch
import os import torch import torch.nn as nn import torch.optim as optim import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP # Simple model definition class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 1) def forward(self, x): return self.linear(x) # Setup distributed environment variables for single-node 1 GPU dist.init_process_group(backend='nccl', rank=0, world_size=1) device = torch.device('cuda:0') # Create model and move to device model = SimpleModel().to(device) # Wrap model with DDP model = DDP(model, device_ids=[0]) # Create dummy input and target inputs = torch.randn(5, 10).to(device) targets = torch.randn(5, 1).to(device) # Loss and optimizer criterion = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.01) # Training step model.train() optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() optimizer.step() print(f"Loss after one step: {loss.item():.4f}")
OutputSuccess
Important Notes
DistributedDataParallel requires one process per GPU for best performance.
Always call dist.init_process_group before creating the DDP model.
Use the correct backend ('nccl' for GPUs, 'gloo' for CPUs) depending on your hardware.
Summary
DistributedDataParallel speeds up training by using multiple GPUs or machines.
Initialize the process group and wrap your model with DDP for synchronized training.
Each process should handle one GPU and run its own training loop.