PytorchComparisonIntermediate · 4 min read

DataParallel vs DistributedDataParallel in PyTorch: Key Differences and Usage

In PyTorch, DataParallel splits data across multiple GPUs on a single machine but uses one process, which can cause bottlenecks. DistributedDataParallel runs a separate process per GPU, enabling faster and more scalable training across multiple GPUs and machines.

⚖️

Quick Comparison

This table summarizes the main differences between DataParallel and DistributedDataParallel in PyTorch.

Feature	DataParallel	DistributedDataParallel
GPU Usage	Multiple GPUs on one machine	Multiple GPUs on one or multiple machines
Process Model	Single process with threads	One process per GPU
Performance	Slower due to GIL and single-process bottleneck	Faster and more efficient communication
Scalability	Limited to one machine	Scales across machines
Setup Complexity	Simple to use, minimal setup	Requires initialization of distributed backend
Recommended Use	Small scale, quick multi-GPU on one machine	Large scale, multi-GPU and multi-node training

⚖️

Key Differences

DataParallel works by splitting the input batch across GPUs on a single machine within one process. It uses Python threads and the Global Interpreter Lock (GIL) can slow down the model's forward and backward passes. This can cause a bottleneck especially when using many GPUs.

In contrast, DistributedDataParallel launches one process per GPU. Each process handles its own GPU and communicates gradients asynchronously using optimized backend libraries like NCCL. This design avoids the GIL bottleneck and allows better parallelism and faster training.

Additionally, DistributedDataParallel supports multi-node training, meaning you can train across several machines, while DataParallel is limited to a single machine. However, DistributedDataParallel requires more setup, including initializing the distributed environment and managing multiple processes.

💻

DataParallel Code Example

python

import torch
import torch.nn as nn
import torch.optim as optim

# Simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)

# Create model and move to GPUs
model = SimpleModel()
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)
model = model.cuda()

# Dummy input and target
input = torch.randn(16, 10).cuda()
target = torch.randn(16, 1).cuda()

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Forward pass
output = model(input)
loss = criterion(output, target)

# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"Loss: {loss.item():.4f}")

Output

Loss: 0.XXXX

↔️

DistributedDataParallel Equivalent

python

import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize distributed environment
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

def setup(rank, world_size):
    dist.init_process_group('nccl', rank=rank, world_size=world_size)


def cleanup():
    dist.destroy_process_group()


class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)


def demo_ddp(rank, world_size):
    setup(rank, world_size)

    torch.cuda.set_device(rank)
    model = SimpleModel().cuda(rank)
    ddp_model = DDP(model, device_ids=[rank])

    criterion = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)

    input = torch.randn(16, 10).cuda(rank)
    target = torch.randn(16, 1).cuda(rank)

    output = ddp_model(input)
    loss = criterion(output, target)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print(f"Rank {rank} Loss: {loss.item():.4f}")

    cleanup()


if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    if world_size < 2:
        print("Need at least 2 GPUs for DDP example")
    else:
        import torch.multiprocessing as mp
        mp.spawn(demo_ddp, args=(world_size,), nprocs=world_size, join=True)

Output

Rank 0 Loss: 0.XXXX Rank 1 Loss: 0.XXXX ...

🎯

When to Use Which

Choose DataParallel if you want a quick and simple way to use multiple GPUs on a single machine without extra setup. It is good for small projects or experiments where ease of use is more important than speed.

Choose DistributedDataParallel for serious training tasks that require speed and scalability. It is the recommended approach for multi-GPU training on one or multiple machines because it avoids Python bottlenecks and supports efficient communication.

In summary, DistributedDataParallel is the modern, high-performance choice, while DataParallel is simpler but less efficient.

✅

Key Takeaways

DistributedDataParallel is faster and scales better than DataParallel.

DataParallel uses one process and threads, causing bottlenecks with many GPUs.

DistributedDataParallel runs one process per GPU and supports multi-node training.

Use DataParallel for simple single-machine multi-GPU setups.

Use DistributedDataParallel for production-level, scalable training.