DataParallel vs DistributedDataParallel in PyTorch: Key Differences and Usage
DataParallel splits data across multiple GPUs on a single machine but uses one process, which can cause bottlenecks. DistributedDataParallel runs a separate process per GPU, enabling faster and more scalable training across multiple GPUs and machines.Quick Comparison
This table summarizes the main differences between DataParallel and DistributedDataParallel in PyTorch.
| Feature | DataParallel | DistributedDataParallel |
|---|---|---|
| GPU Usage | Multiple GPUs on one machine | Multiple GPUs on one or multiple machines |
| Process Model | Single process with threads | One process per GPU |
| Performance | Slower due to GIL and single-process bottleneck | Faster and more efficient communication |
| Scalability | Limited to one machine | Scales across machines |
| Setup Complexity | Simple to use, minimal setup | Requires initialization of distributed backend |
| Recommended Use | Small scale, quick multi-GPU on one machine | Large scale, multi-GPU and multi-node training |
Key Differences
DataParallel works by splitting the input batch across GPUs on a single machine within one process. It uses Python threads and the Global Interpreter Lock (GIL) can slow down the model's forward and backward passes. This can cause a bottleneck especially when using many GPUs.
In contrast, DistributedDataParallel launches one process per GPU. Each process handles its own GPU and communicates gradients asynchronously using optimized backend libraries like NCCL. This design avoids the GIL bottleneck and allows better parallelism and faster training.
Additionally, DistributedDataParallel supports multi-node training, meaning you can train across several machines, while DataParallel is limited to a single machine. However, DistributedDataParallel requires more setup, including initializing the distributed environment and managing multiple processes.
DataParallel Code Example
import torch import torch.nn as nn import torch.optim as optim # Simple model class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 1) def forward(self, x): return self.linear(x) # Create model and move to GPUs model = SimpleModel() if torch.cuda.device_count() > 1: model = nn.DataParallel(model) model = model.cuda() # Dummy input and target input = torch.randn(16, 10).cuda() target = torch.randn(16, 1).cuda() # Loss and optimizer criterion = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.01) # Forward pass output = model(input) loss = criterion(output, target) # Backward and optimize optimizer.zero_grad() loss.backward() optimizer.step() print(f"Loss: {loss.item():.4f}")
DistributedDataParallel Equivalent
import os import torch import torch.nn as nn import torch.optim as optim import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP # Initialize distributed environment os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' def setup(rank, world_size): dist.init_process_group('nccl', rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 1) def forward(self, x): return self.linear(x) def demo_ddp(rank, world_size): setup(rank, world_size) torch.cuda.set_device(rank) model = SimpleModel().cuda(rank) ddp_model = DDP(model, device_ids=[rank]) criterion = nn.MSELoss() optimizer = optim.SGD(ddp_model.parameters(), lr=0.01) input = torch.randn(16, 10).cuda(rank) target = torch.randn(16, 1).cuda(rank) output = ddp_model(input) loss = criterion(output, target) optimizer.zero_grad() loss.backward() optimizer.step() print(f"Rank {rank} Loss: {loss.item():.4f}") cleanup() if __name__ == '__main__': world_size = torch.cuda.device_count() if world_size < 2: print("Need at least 2 GPUs for DDP example") else: import torch.multiprocessing as mp mp.spawn(demo_ddp, args=(world_size,), nprocs=world_size, join=True)
When to Use Which
Choose DataParallel if you want a quick and simple way to use multiple GPUs on a single machine without extra setup. It is good for small projects or experiments where ease of use is more important than speed.
Choose DistributedDataParallel for serious training tasks that require speed and scalability. It is the recommended approach for multi-GPU training on one or multiple machines because it avoids Python bottlenecks and supports efficient communication.
In summary, DistributedDataParallel is the modern, high-performance choice, while DataParallel is simpler but less efficient.
Key Takeaways
DistributedDataParallel is faster and scales better than DataParallel.DataParallel uses one process and threads, causing bottlenecks with many GPUs.DistributedDataParallel runs one process per GPU and supports multi-node training.DataParallel for simple single-machine multi-GPU setups.DistributedDataParallel for production-level, scalable training.