Complete the code to initialize the distributed process group.
import torch.distributed as dist dist.init_process_group(backend=[1], init_method='env://')
The init_process_group function requires a valid backend. 'gloo' is a common CPU backend that works on most platforms.
Complete the code to wrap the model with DistributedDataParallel.
import torch.nn as nn import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP model = nn.Linear(10, 5).to(device) model = [1](model)
To use DistributedDataParallel, you wrap your model with DDP after moving it to the device.
Fix the error in the code to correctly set the device for the process.
import torch import torch.distributed as dist import os local_rank = int(os.environ['LOCAL_RANK']) torch.cuda.set_device([1])
Each process should set its CUDA device to its local rank to avoid conflicts.
Fill both blanks to create a DistributedSampler and DataLoader for distributed training.
from torch.utils.data import DataLoader, [1] sampler = [2](dataset, num_replicas=dist.get_world_size(), rank=dist.get_rank()) dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
DistributedSampler ensures each process gets a unique subset of data for training.
Fill all three blanks to correctly perform a training step with DistributedDataParallel.
optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.[1]() optimizer.[2]() if dist.get_rank() == 0: print('Loss:', loss.[3]().item())
Call loss.backward() to compute gradients, optimizer.step() to update weights, and loss.detach() to get a tensor without gradient for printing.