0
0
PyTorchml~20 mins

DistributedDataParallel in PyTorch - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
DistributedDataParallel Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Why use DistributedDataParallel (DDP)?

Imagine you want to train a deep learning model faster by using multiple GPUs. Why is DistributedDataParallel (DDP) preferred over DataParallel in PyTorch?

ADDP reduces GPU memory usage by splitting the model across GPUs, while DataParallel duplicates the model on each GPU.
BDDP synchronizes gradients efficiently across GPUs and avoids Python GIL bottlenecks, while DataParallel runs on a single process causing slower training.
CDDP automatically increases batch size without code changes, while DataParallel requires manual batch size tuning.
DDDP only works on CPU clusters, while DataParallel is designed for GPUs.
Attempts:
2 left
💡 Hint

Think about how Python's Global Interpreter Lock (GIL) affects multi-threading and how DDP uses multiple processes.

Predict Output
intermediate
2:00remaining
Output of DDP model training snippet

What will be the printed output of the following PyTorch code snippet using DistributedDataParallel?

PyTorch
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group('gloo', rank=rank, world_size=world_size)

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 2)
    def forward(self, x):
        return self.linear(x)

rank = 0
world_size = 1
setup(rank, world_size)
model = SimpleModel().to(rank)
ddp_model = DDP(model, device_ids=None)
input_tensor = torch.tensor([[1.0, 2.0]], device=rank)
output = ddp_model(input_tensor)
print(output)
Atensor([[some_value, some_value]], grad_fn=<AddmmBackward0>)
Btensor([[1.0, 2.0]], grad_fn=<AddmmBackward0>)
CRuntimeError: Expected all tensors to be on the same device
Dtensor([[0.0, 0.0]], grad_fn=<AddmmBackward0>)
Attempts:
2 left
💡 Hint

The model has a linear layer with random weights initialized. The output will be a tensor with computed values, not the input itself.

Hyperparameter
advanced
2:00remaining
Choosing the correct batch size with DistributedDataParallel

You have 4 GPUs and want to train a model using DistributedDataParallel. Your original batch size for single GPU training is 64. What should be the batch size per GPU when using DDP to keep the effective batch size the same?

A64 per GPU, total batch size 256
B256 per GPU, total batch size 1024
C16 per GPU, total batch size 64
D1 per GPU, total batch size 4
Attempts:
2 left
💡 Hint

Think about how DDP splits data across GPUs and how total batch size is calculated.

🔧 Debug
advanced
2:00remaining
Why does this DDP code raise RuntimeError?

Consider this PyTorch DDP training snippet. It raises a RuntimeError: "Expected to have same number of elements in all input tensors" during backward. What is the likely cause?

PyTorch
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group('gloo', rank=rank, world_size=world_size)

rank = 0
world_size = 2
setup(rank, world_size)
model = nn.Linear(2, 2).to(rank)
ddp_model = DDP(model, device_ids=None)

input_tensor = torch.randn(3, 2).to(rank) if rank == 0 else torch.randn(4, 2).to(rank)
output = ddp_model(input_tensor)
loss = output.sum()
loss.backward()
ALoss function is not defined properly causing backward to fail.
BModel weights are not moved to the correct device before wrapping with DDP.
CDistributed process group is not initialized before model creation.
DInput tensors have different batch sizes on each GPU causing mismatch during gradient synchronization.
Attempts:
2 left
💡 Hint

Check the input tensor sizes on each rank and how DDP expects inputs.

Model Choice
expert
2:00remaining
Best model wrapping strategy for mixed CPU-GPU training with DDP

You want to train a model where some layers run on CPU and others on GPU, but still use DistributedDataParallel for multi-GPU training. Which approach is correct?

AWrap only the GPU layers with DDP and keep CPU layers outside the DDP wrapper, combining outputs manually.
BSplit the model into two parts: wrap GPU layers with DDP on GPUs, and run CPU layers separately without DDP.
CWrap the entire model with DDP and move only GPU layers to GPU devices; CPU layers stay on CPU.
DUse DataParallel instead of DDP because DDP requires the whole model on GPU.
Attempts:
2 left
💡 Hint

Consider how DDP expects model parameters and devices.