DataParallel splits the input batch across multiple GPUs, runs the model on each part in parallel, and then combines the results. This speeds up training by using all GPUs efficiently.
import torch import torch.nn as nn class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(2, 1) def forward(self, x): return self.linear(x) model = SimpleModel() model = nn.DataParallel(model) input_tensor = torch.tensor([[1.0, 2.0], [3.0, 4.0]]) output = model(input_tensor) print(output.shape)
The input has batch size 2 and feature size 2. The linear layer outputs size 1 per input. DataParallel splits the batch but output shape remains (2,1).
DataParallel automatically splits the batch across GPUs. So you can keep the original batch size, and it will divide it evenly among GPUs.
import torch import torch.nn as nn model = nn.Linear(10, 5) model = nn.DataParallel(model) input_tensor = torch.randn(3, 10).cuda() output = model(input_tensor) print(output)
The model is created on CPU by default. DataParallel expects the model to be on GPU before wrapping. Input is on GPU, so mismatch causes RuntimeError.
DistributedDataParallel is recommended for multi-GPU training because it handles gradient synchronization efficiently and avoids common pitfalls of DataParallel.