Complete the code to initialize the distributed training environment using PyTorch.
import torch.distributed as dist dist.init_process_group(backend=[1], init_method='env://')
The nccl backend is optimized for NVIDIA GPUs and is commonly used for distributed training on GPU clusters.
Complete the code to wrap the model for distributed training in PyTorch.
import torch.nn as nn import torch.distributed as dist model = nn.Linear(10, 2) model = [1](model)
DistributedDataParallel wraps the model to synchronize gradients across multiple processes during distributed training.
Fix the error in the code to correctly set the device for distributed training.
import torch import os local_rank = int(os.environ['LOCAL_RANK']) torch.cuda.set_device([1])
Setting the CUDA device to local_rank ensures each process uses the correct GPU.
Fill both blanks to create a distributed sampler for the training dataset.
from torch.utils.data import DataLoader, [1] train_sampler = [2](dataset, num_replicas=world_size, rank=rank)
DistributedSampler ensures each process gets a unique subset of the dataset during distributed training.
Fill all three blanks to correctly configure the DataLoader for distributed training.
train_loader = DataLoader(dataset, batch_size=[1], sampler=[2], shuffle=[3])
When using a sampler, shuffle should be set to False because the sampler controls data shuffling.