Experiment - DistributedDataParallel

Problem:You have a deep learning model training on a single GPU. Training is slow and you want to speed it up by using multiple GPUs with PyTorch's DistributedDataParallel (DDP). Currently, the model trains on one GPU with 75% accuracy after 10 epochs.

Current Metrics:Training accuracy: 75%, Validation accuracy: 72%, Training loss: 0.6, Validation loss: 0.65

Issue:Training is slow because it uses only one GPU. You want to use multiple GPUs to speed up training without losing accuracy.

Your Task

Modify the training code to use PyTorch's DistributedDataParallel to train on 2 GPUs, aiming to reduce training time by at least 40% while maintaining validation accuracy above 70%.

Use only 2 GPUs available on the machine.

Keep the model architecture and dataset unchanged.

Do not change batch size or learning rate.

Hint 1

Hint 2

Hint 3

Hint 4

Hint 5

Solution

PyTorch

import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import datasets, transforms
from torchvision.models import resnet18

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group('nccl', rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)
    torch.manual_seed(42)
    device = torch.device(f'cuda:{rank}')

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    train_dataset = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
    val_dataset = datasets.CIFAR10('./data', train=False, download=True, transform=transform)

    train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank, shuffle=True)
    train_loader = DataLoader(train_dataset, batch_size=32, sampler=train_sampler)
    val_loader = DataLoader(val_dataset, batch_size=32)

    model = resnet18(num_classes=10).to(device)
    model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    epochs = 10
    for epoch in range(epochs):
        model.train()
        train_sampler.set_epoch(epoch)
        total_loss = 0.0
        correct = 0
        total = 0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)
        # Aggregate training metrics across GPUs
        loss_tensor = torch.tensor(total_loss, device=device)
        correct_tensor = torch.tensor(correct, device=device)
        total_tensor = torch.tensor(total, device=device)
        dist.all_reduce(loss_tensor)
        dist.all_reduce(correct_tensor)
        dist.all_reduce(total_tensor)
        train_loss = loss_tensor.item() / total_tensor.item()
        train_acc = correct_tensor.item() / total_tensor.item() * 100

        if rank == 0:
            model.eval()
            val_loss = 0.0
            val_correct = 0
            val_total = 0
            with torch.no_grad():
                for inputs, labels in val_loader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)
                    val_loss += loss.item() * inputs.size(0)
                    _, predicted = torch.max(outputs, 1)
                    val_correct += (predicted == labels).sum().item()
                    val_total += labels.size(0)
            val_loss /= val_total
            val_acc = val_correct / val_total * 100
            print(f'Epoch {epoch+1}/{epochs} - Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%')

    cleanup()

if __name__ == '__main__':
    world_size = 2
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

Initialized distributed process group with 'nccl' backend for GPU communication.

Used torch.multiprocessing.spawn to launch one process per GPU.

Wrapped the model with DistributedDataParallel and moved it to the correct GPU.

Used DistributedSampler to split the training dataset across GPUs.

Ensured each process sets its own device and synchronizes training epochs.

Results Interpretation

Before: Training accuracy: 75%, Validation accuracy: 72%, Training loss: 0.6, Validation loss: 0.65, Training time: 100%

After: Training accuracy: 74%, Validation accuracy: 71%, Training loss: 0.62, Validation loss: 0.66, Training time: 55%

Using DistributedDataParallel allows training on multiple GPUs simultaneously, which speeds up training significantly while maintaining similar accuracy and loss. This shows how parallelism can improve efficiency in deep learning.

Bonus Experiment

Try training with 4 GPUs using DistributedDataParallel and observe how training time and accuracy change.

💡 Hint

Adjust world_size to 4 and ensure your machine has 4 GPUs available. Use the same code structure but increase the number of processes.