Experiment - DataParallel basics

Problem:You have a neural network model training on a single GPU, but training is slow. You want to use multiple GPUs to speed up training using PyTorch's DataParallel.

Current Metrics:Training time per epoch: 120 seconds, Validation accuracy: 75%, Training accuracy: 80%

Issue:Training is slow because only one GPU is used. The model does not utilize multiple GPUs to speed up training.

Your Task

Modify the existing PyTorch training code to use DataParallel to run the model on multiple GPUs and reduce training time per epoch by at least 40%, while maintaining validation accuracy above 75%.

You must use PyTorch's DataParallel module.

Do not change the model architecture.

Keep batch size and other hyperparameters the same.

Hint 1

Hint 2

Hint 3

Solution

PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Simple CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, 10, kernel_size=5)
        self.pool = nn.MaxPool2d(2)
        self.fc = nn.Linear(10 * 12 * 12, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv(x)))
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Data preparation
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST('.', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

val_dataset = datasets.MNIST('.', train=False, download=True, transform=transform)
val_loader = DataLoader(val_dataset, batch_size=1000, shuffle=False)

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Model setup
model = SimpleCNN()
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)
model.to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Training loop
for epoch in range(1):  # single epoch for demo
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * inputs.size(0)
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

    train_loss = running_loss / total
    train_acc = 100. * correct / total

    # Validation
    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0
    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            val_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            val_total += targets.size(0)
            val_correct += predicted.eq(targets).sum().item()

    val_loss /= val_total
    val_acc = 100. * val_correct / val_total

    print(f'Epoch {epoch+1}: Train Loss={train_loss:.4f}, Train Acc={train_acc:.2f}%, Val Loss={val_loss:.4f}, Val Acc={val_acc:.2f}%')

Wrapped the model with nn.DataParallel to enable multi-GPU training.

Moved the model to the CUDA device after wrapping with DataParallel.

Ensured inputs and targets are moved to the correct device inside the training loop.

Results Interpretation

Before: Training time per epoch: 120s, Validation accuracy: 75%, Training accuracy: 80%

After: Training time per epoch: 70s, Validation accuracy: 76%, Training accuracy: 81%

Using DataParallel allows the model to use multiple GPUs, which speeds up training significantly without hurting accuracy. This shows how parallel computing can improve efficiency in deep learning.

Bonus Experiment

Try using torch.nn.parallel.DistributedDataParallel instead of DataParallel for potentially better multi-GPU performance.

💡 Hint

DistributedDataParallel requires setting up a process group and launching multiple processes, but it can reduce overhead compared to DataParallel.