0
0
PytorchComparisonIntermediate · 4 min read

DataParallel vs DistributedDataParallel in PyTorch: Key Differences and Usage

In PyTorch, DataParallel splits data across multiple GPUs on a single machine but uses one process, which can cause bottlenecks. DistributedDataParallel runs a separate process per GPU, enabling faster and more scalable training across multiple GPUs and machines.
⚖️

Quick Comparison

This table summarizes the main differences between DataParallel and DistributedDataParallel in PyTorch.

FeatureDataParallelDistributedDataParallel
GPU UsageMultiple GPUs on one machineMultiple GPUs on one or multiple machines
Process ModelSingle process with threadsOne process per GPU
PerformanceSlower due to GIL and single-process bottleneckFaster and more efficient communication
ScalabilityLimited to one machineScales across machines
Setup ComplexitySimple to use, minimal setupRequires initialization of distributed backend
Recommended UseSmall scale, quick multi-GPU on one machineLarge scale, multi-GPU and multi-node training
⚖️

Key Differences

DataParallel works by splitting the input batch across GPUs on a single machine within one process. It uses Python threads and the Global Interpreter Lock (GIL) can slow down the model's forward and backward passes. This can cause a bottleneck especially when using many GPUs.

In contrast, DistributedDataParallel launches one process per GPU. Each process handles its own GPU and communicates gradients asynchronously using optimized backend libraries like NCCL. This design avoids the GIL bottleneck and allows better parallelism and faster training.

Additionally, DistributedDataParallel supports multi-node training, meaning you can train across several machines, while DataParallel is limited to a single machine. However, DistributedDataParallel requires more setup, including initializing the distributed environment and managing multiple processes.

💻

DataParallel Code Example

python
import torch
import torch.nn as nn
import torch.optim as optim

# Simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)

# Create model and move to GPUs
model = SimpleModel()
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)
model = model.cuda()

# Dummy input and target
input = torch.randn(16, 10).cuda()
target = torch.randn(16, 1).cuda()

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Forward pass
output = model(input)
loss = criterion(output, target)

# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"Loss: {loss.item():.4f}")
Output
Loss: 0.XXXX
↔️

DistributedDataParallel Equivalent

python
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize distributed environment
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

def setup(rank, world_size):
    dist.init_process_group('nccl', rank=rank, world_size=world_size)


def cleanup():
    dist.destroy_process_group()


class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)


def demo_ddp(rank, world_size):
    setup(rank, world_size)

    torch.cuda.set_device(rank)
    model = SimpleModel().cuda(rank)
    ddp_model = DDP(model, device_ids=[rank])

    criterion = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)

    input = torch.randn(16, 10).cuda(rank)
    target = torch.randn(16, 1).cuda(rank)

    output = ddp_model(input)
    loss = criterion(output, target)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print(f"Rank {rank} Loss: {loss.item():.4f}")

    cleanup()


if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    if world_size < 2:
        print("Need at least 2 GPUs for DDP example")
    else:
        import torch.multiprocessing as mp
        mp.spawn(demo_ddp, args=(world_size,), nprocs=world_size, join=True)
Output
Rank 0 Loss: 0.XXXX Rank 1 Loss: 0.XXXX ...
🎯

When to Use Which

Choose DataParallel if you want a quick and simple way to use multiple GPUs on a single machine without extra setup. It is good for small projects or experiments where ease of use is more important than speed.

Choose DistributedDataParallel for serious training tasks that require speed and scalability. It is the recommended approach for multi-GPU training on one or multiple machines because it avoids Python bottlenecks and supports efficient communication.

In summary, DistributedDataParallel is the modern, high-performance choice, while DataParallel is simpler but less efficient.

Key Takeaways

DistributedDataParallel is faster and scales better than DataParallel.
DataParallel uses one process and threads, causing bottlenecks with many GPUs.
DistributedDataParallel runs one process per GPU and supports multi-node training.
Use DataParallel for simple single-machine multi-GPU setups.
Use DistributedDataParallel for production-level, scalable training.