What is DistributedDataParallel in PyTorch?

PyTorchml~7 mins

DistributedDataParallel in PyTorch

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Introduction

DistributedDataParallel helps train machine learning models faster by using many computers or GPUs at the same time. It splits the work so the model learns quicker.

You want to train a large neural network faster by using multiple GPUs.

You have a big dataset that takes too long to train on one machine.

You want to scale your training across several computers in a cluster.

You want to keep your model synchronized across devices during training.

You need to reduce training time for deep learning projects.

Syntax

PyTorch

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group for communication
dist.init_process_group(backend='nccl')

# Create model and move it to GPU
model = MyModel().to(device)

# Wrap model with DistributedDataParallel
model = DDP(model, device_ids=[device_id])

You must initialize the process group before using DistributedDataParallel.

Each process should handle one GPU and wrap the model separately.

Examples

Wraps the model on GPU 0 for distributed training on a single machine with one GPU.

PyTorch

model = DDP(model, device_ids=[0])

Wraps the model on GPU matching the process rank, useful in multi-GPU multi-process setups.

PyTorch

model = DDP(model, device_ids=[rank], output_device=rank)

Initializes the process group using environment variables and the Gloo backend, often used for CPU or cross-machine training.

PyTorch

dist.init_process_group(backend='gloo', init_method='env://')

Sample Model

This code sets up a simple linear model wrapped with DistributedDataParallel on one GPU. It runs one training step and prints the loss.

PyTorch

import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Simple model definition
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 1)
    def forward(self, x):
        return self.linear(x)

# Setup distributed environment variables for single-node 1 GPU
dist.init_process_group(backend='nccl', rank=0, world_size=1)

device = torch.device('cuda:0')

# Create model and move to device
model = SimpleModel().to(device)

# Wrap model with DDP
model = DDP(model, device_ids=[0])

# Create dummy input and target
inputs = torch.randn(5, 10).to(device)
targets = torch.randn(5, 1).to(device)

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training step
model.train()
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()

print(f"Loss after one step: {loss.item():.4f}")

OutputSuccess

Important Notes

DistributedDataParallel requires one process per GPU for best performance.

Always call dist.init_process_group before creating the DDP model.

Use the correct backend ('nccl' for GPUs, 'gloo' for CPUs) depending on your hardware.

Summary

DistributedDataParallel speeds up training by using multiple GPUs or machines.

Initialize the process group and wrap your model with DDP for synchronized training.

Each process should handle one GPU and run its own training loop.