Challenge - 5 Problems

🎖️

DistributedDataParallel Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why use DistributedDataParallel (DDP)?

Imagine you want to train a deep learning model faster by using multiple GPUs. Why is DistributedDataParallel (DDP) preferred over DataParallel in PyTorch?

ADDP reduces GPU memory usage by splitting the model across GPUs, while DataParallel duplicates the model on each GPU.

BDDP synchronizes gradients efficiently across GPUs and avoids Python GIL bottlenecks, while DataParallel runs on a single process causing slower training.

CDDP automatically increases batch size without code changes, while DataParallel requires manual batch size tuning.

DDDP only works on CPU clusters, while DataParallel is designed for GPUs.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of DDP model training snippet

What will be the printed output of the following PyTorch code snippet using DistributedDataParallel?

PyTorch

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group('gloo', rank=rank, world_size=world_size)

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 2)
    def forward(self, x):
        return self.linear(x)

rank = 0
world_size = 1
setup(rank, world_size)
model = SimpleModel().to(rank)
ddp_model = DDP(model, device_ids=None)
input_tensor = torch.tensor([[1.0, 2.0]], device=rank)
output = ddp_model(input_tensor)
print(output)

Atensor([[some_value, some_value]], grad_fn=<AddmmBackward0>)

Btensor([[1.0, 2.0]], grad_fn=<AddmmBackward0>)

CRuntimeError: Expected all tensors to be on the same device

Dtensor([[0.0, 0.0]], grad_fn=<AddmmBackward0>)

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Choosing the correct batch size with DistributedDataParallel

You have 4 GPUs and want to train a model using DistributedDataParallel. Your original batch size for single GPU training is 64. What should be the batch size per GPU when using DDP to keep the effective batch size the same?

A64 per GPU, total batch size 256

B256 per GPU, total batch size 1024

C16 per GPU, total batch size 64

D1 per GPU, total batch size 4

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Why does this DDP code raise RuntimeError?

Consider this PyTorch DDP training snippet. It raises a RuntimeError: "Expected to have same number of elements in all input tensors" during backward. What is the likely cause?

PyTorch

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group('gloo', rank=rank, world_size=world_size)

rank = 0
world_size = 2
setup(rank, world_size)
model = nn.Linear(2, 2).to(rank)
ddp_model = DDP(model, device_ids=None)

input_tensor = torch.randn(3, 2).to(rank) if rank == 0 else torch.randn(4, 2).to(rank)
output = ddp_model(input_tensor)
loss = output.sum()
loss.backward()

ALoss function is not defined properly causing backward to fail.

BModel weights are not moved to the correct device before wrapping with DDP.

CDistributed process group is not initialized before model creation.

DInput tensors have different batch sizes on each GPU causing mismatch during gradient synchronization.

Attempts:

2 left

❓ Model Choice

expert

2:00remaining

Best model wrapping strategy for mixed CPU-GPU training with DDP

You want to train a model where some layers run on CPU and others on GPU, but still use DistributedDataParallel for multi-GPU training. Which approach is correct?

AWrap only the GPU layers with DDP and keep CPU layers outside the DDP wrapper, combining outputs manually.

BSplit the model into two parts: wrap GPU layers with DDP on GPUs, and run CPU layers separately without DDP.

CWrap the entire model with DDP and move only GPU layers to GPU devices; CPU layers stay on CPU.

DUse DataParallel instead of DDP because DDP requires the whole model on GPU.

Attempts:

2 left