0
0
PyTorchml~20 mins

Multi-GPU training in PyTorch - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Multi-GPU Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of PyTorch DataParallel model forward pass
Consider the following PyTorch code snippet using DataParallel for multi-GPU training. What will be the output shape of output after the forward pass?
PyTorch
import torch
import torch.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 5)

    def forward(self, x):
        return self.linear(x)

model = SimpleModel()
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)

model = model.cuda()
input_tensor = torch.randn(16, 10).cuda()
output = model(input_tensor)
print(output.shape)
Atorch.Size([8, 5])
Btorch.Size([32, 5])
CRuntimeError due to input size mismatch
Dtorch.Size([16, 5])
Attempts:
2 left
💡 Hint
DataParallel splits the batch across GPUs but concatenates outputs back to original batch size.
🧠 Conceptual
intermediate
2:00remaining
Understanding DistributedDataParallel vs DataParallel
Which statement correctly describes a key difference between PyTorch's DistributedDataParallel (DDP) and DataParallel (DP) for multi-GPU training?
ADP requires manual gradient synchronization, but DDP automatically splits input batches across GPUs.
BDDP runs model replicas in separate processes and synchronizes gradients, while DP runs replicas in a single process splitting input batches.
CDDP only works on a single GPU, while DP supports multiple GPUs on one machine.
DDP uses multiple machines for training, while DDP is limited to one machine.
Attempts:
2 left
💡 Hint
Think about process management and how input batches are handled.
Hyperparameter
advanced
2:00remaining
Choosing batch size for multi-GPU training
You want to train a model on 4 GPUs using DistributedDataParallel. Your single-GPU batch size is 32. What is the best practice for setting the batch size when using 4 GPUs?
AKeep batch size 32 and let each GPU process the full batch independently.
BSet batch size to 8 so total batch size is 8 across all GPUs.
CSet batch size to 128 so each GPU processes 32 samples (total batch size = 4 * 32).
DSet batch size to 32 and manually average gradients across GPUs.
Attempts:
2 left
💡 Hint
Consider how DistributedDataParallel splits data across GPUs.
🔧 Debug
advanced
2:00remaining
Identifying error in multi-GPU model saving
You trained a model using nn.DataParallel and saved it with torch.save(model.state_dict(), 'model.pth'). When loading the model later without DataParallel, you get a key mismatch error. What is the cause?
AThe saved state dict keys are prefixed with 'module.' due to DataParallel wrapping.
BThe model was saved on CPU but loaded on GPU causing mismatch.
CThe model architecture changed between saving and loading causing mismatch.
DThe saved file is corrupted and cannot be loaded properly.
Attempts:
2 left
💡 Hint
DataParallel adds a prefix to parameter names in the state dict.
Metrics
expert
2:00remaining
Interpreting training speedup with multi-GPU training
You train a model on 1 GPU and it takes 100 minutes per epoch. Using 4 GPUs with DistributedDataParallel, the epoch time reduces to 30 minutes. Which statement best explains this result?
AThe speedup is less than 4x due to communication overhead and synchronization costs.
BThe speedup is exactly 4x because 4 GPUs process data independently without overhead.
CThe speedup is more than 4x because GPUs work better together than alone.
DThe speedup is unrelated to GPUs and caused by random fluctuations.
Attempts:
2 left
💡 Hint
Consider overheads in multi-GPU distributed training.