How to Use DataLoader in PyTorch: Syntax and Example
Use
torch.utils.data.DataLoader to load data in batches, shuffle it, and enable parallel loading in PyTorch. It wraps a dataset and provides an iterable over the data with options like batch_size, shuffle, and num_workers.Syntax
The DataLoader is created by passing a dataset and specifying parameters like batch size and shuffling. Key parameters include:
dataset: The dataset object to load data from.batch_size: Number of samples per batch.shuffle: Whether to shuffle data each epoch.num_workers: Number of subprocesses for data loading.
python
from torch.utils.data import DataLoader dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)
Example
This example shows how to create a simple dataset and use DataLoader to iterate over batches of data with shuffling enabled.
python
import torch from torch.utils.data import Dataset, DataLoader # Define a simple dataset class MyDataset(Dataset): def __init__(self): self.data = torch.arange(10) # Data from 0 to 9 def __len__(self): return len(self.data) def __getitem__(self, idx): return self.data[idx] # Create dataset and dataloader my_dataset = MyDataset() dataloader = DataLoader(my_dataset, batch_size=3, shuffle=True) # Iterate over batches for batch in dataloader: print(batch)
Output
tensor([3, 1, 0])
tensor([7, 8, 6])
tensor([9, 4, 5])
tensor([2])
Common Pitfalls
Common mistakes when using DataLoader include:
- Not setting
shuffle=Truewhen you want random batches, which can cause your model to see data in the same order every epoch. - Using
num_workers=0can slow down loading; increasing it speeds up data loading but may cause issues on some systems. - For datasets that return multiple items (like input and label), forgetting to unpack batches properly.
python
from torch.utils.data import DataLoader # Wrong: no shuffle, slow loading loader_wrong = DataLoader(my_dataset, batch_size=4, shuffle=False, num_workers=0) # Right: shuffle data and use multiple workers loader_right = DataLoader(my_dataset, batch_size=4, shuffle=True, num_workers=2)
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| dataset | Dataset to load data from | None (required) |
| batch_size | Number of samples per batch | 1 |
| shuffle | Shuffle data every epoch | False |
| num_workers | Number of subprocesses for loading | 0 |
| drop_last | Drop last incomplete batch | False |
| pin_memory | Copy tensors to CUDA pinned memory | False |
Key Takeaways
Use DataLoader to efficiently load data in batches with optional shuffling and parallelism.
Set shuffle=True during training to improve model generalization by randomizing data order.
Increase num_workers to speed up data loading but test for system compatibility.
Remember to unpack batches correctly if your dataset returns multiple items like inputs and labels.
Use drop_last=True to avoid smaller last batches when batch size consistency is needed.