Challenge - 5 Problems
Multi-GPU Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of PyTorch DataParallel model forward pass
Consider the following PyTorch code snippet using DataParallel for multi-GPU training. What will be the output shape of
output after the forward pass?PyTorch
import torch import torch.nn as nn class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 5) def forward(self, x): return self.linear(x) model = SimpleModel() if torch.cuda.device_count() > 1: model = nn.DataParallel(model) model = model.cuda() input_tensor = torch.randn(16, 10).cuda() output = model(input_tensor) print(output.shape)
Attempts:
2 left
💡 Hint
DataParallel splits the batch across GPUs but concatenates outputs back to original batch size.
✗ Incorrect
DataParallel splits the input batch (size 16) across available GPUs, processes them in parallel, then concatenates the outputs. The output shape matches the input batch size with the output feature size (5).
🧠 Conceptual
intermediate2:00remaining
Understanding DistributedDataParallel vs DataParallel
Which statement correctly describes a key difference between PyTorch's DistributedDataParallel (DDP) and DataParallel (DP) for multi-GPU training?
Attempts:
2 left
💡 Hint
Think about process management and how input batches are handled.
✗ Incorrect
DistributedDataParallel runs one process per GPU and synchronizes gradients across processes, which is more efficient and scalable. DataParallel runs in one process and splits input batches across GPUs, which can be slower.
❓ Hyperparameter
advanced2:00remaining
Choosing batch size for multi-GPU training
You want to train a model on 4 GPUs using DistributedDataParallel. Your single-GPU batch size is 32. What is the best practice for setting the batch size when using 4 GPUs?
Attempts:
2 left
💡 Hint
Consider how DistributedDataParallel splits data across GPUs.
✗ Incorrect
In DDP, each GPU processes a subset of the batch. To keep the effective batch size the same as single-GPU training, multiply the batch size by the number of GPUs.
🔧 Debug
advanced2:00remaining
Identifying error in multi-GPU model saving
You trained a model using nn.DataParallel and saved it with
torch.save(model.state_dict(), 'model.pth'). When loading the model later without DataParallel, you get a key mismatch error. What is the cause?Attempts:
2 left
💡 Hint
DataParallel adds a prefix to parameter names in the state dict.
✗ Incorrect
DataParallel wraps the model and prefixes all parameter keys with 'module.'. Loading this state dict into a model without DataParallel causes key mismatch errors.
❓ Metrics
expert2:00remaining
Interpreting training speedup with multi-GPU training
You train a model on 1 GPU and it takes 100 minutes per epoch. Using 4 GPUs with DistributedDataParallel, the epoch time reduces to 30 minutes. Which statement best explains this result?
Attempts:
2 left
💡 Hint
Consider overheads in multi-GPU distributed training.
✗ Incorrect
DistributedDataParallel parallelizes training but requires communication and gradient synchronization, which adds overhead and reduces perfect scaling.