Imagine you want to train a deep learning model faster by using multiple GPUs. Why is DistributedDataParallel (DDP) preferred over DataParallel in PyTorch?
Think about how Python's Global Interpreter Lock (GIL) affects multi-threading and how DDP uses multiple processes.
DDP uses multiple processes, each handling one GPU, which avoids Python's GIL and synchronizes gradients efficiently. DataParallel uses threads within one process, causing slower training and more overhead.
What will be the printed output of the following PyTorch code snippet using DistributedDataParallel?
import torch import torch.nn as nn import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP def setup(rank, world_size): dist.init_process_group('gloo', rank=rank, world_size=world_size) class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(2, 2) def forward(self, x): return self.linear(x) rank = 0 world_size = 1 setup(rank, world_size) model = SimpleModel().to(rank) ddp_model = DDP(model, device_ids=None) input_tensor = torch.tensor([[1.0, 2.0]], device=rank) output = ddp_model(input_tensor) print(output)
The model has a linear layer with random weights initialized. The output will be a tensor with computed values, not the input itself.
The linear layer applies a matrix multiplication plus bias to the input tensor, producing a new tensor with different values. Since weights are randomly initialized, the exact values vary but the output shape and type are as shown.
You have 4 GPUs and want to train a model using DistributedDataParallel. Your original batch size for single GPU training is 64. What should be the batch size per GPU when using DDP to keep the effective batch size the same?
Think about how DDP splits data across GPUs and how total batch size is calculated.
In DDP, each GPU processes its own batch. To keep the total batch size same as single GPU training, divide the original batch size by number of GPUs. So 64 / 4 = 16 per GPU.
Consider this PyTorch DDP training snippet. It raises a RuntimeError: "Expected to have same number of elements in all input tensors" during backward. What is the likely cause?
import torch import torch.nn as nn import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP def setup(rank, world_size): dist.init_process_group('gloo', rank=rank, world_size=world_size) rank = 0 world_size = 2 setup(rank, world_size) model = nn.Linear(2, 2).to(rank) ddp_model = DDP(model, device_ids=None) input_tensor = torch.randn(3, 2).to(rank) if rank == 0 else torch.randn(4, 2).to(rank) output = ddp_model(input_tensor) loss = output.sum() loss.backward()
Check the input tensor sizes on each rank and how DDP expects inputs.
DDP requires all processes to have the same batch size because it synchronizes gradients across all GPUs. Different input sizes cause mismatch errors during backward.
You want to train a model where some layers run on CPU and others on GPU, but still use DistributedDataParallel for multi-GPU training. Which approach is correct?
Consider how DDP expects model parameters and devices.
DDP requires all parameters it manages to be on the same device per process. Wrapping only GPU layers with DDP and handling CPU layers separately allows mixed device usage while still benefiting from DDP on GPUs.