0
0
PytorchHow-ToBeginner · 3 min read

How to Use DataLoader in PyTorch: Syntax and Example

Use torch.utils.data.DataLoader to load data in batches, shuffle it, and enable parallel loading in PyTorch. It wraps a dataset and provides an iterable over the data with options like batch_size, shuffle, and num_workers.
📐

Syntax

The DataLoader is created by passing a dataset and specifying parameters like batch size and shuffling. Key parameters include:

  • dataset: The dataset object to load data from.
  • batch_size: Number of samples per batch.
  • shuffle: Whether to shuffle data each epoch.
  • num_workers: Number of subprocesses for data loading.
python
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)
💻

Example

This example shows how to create a simple dataset and use DataLoader to iterate over batches of data with shuffling enabled.

python
import torch
from torch.utils.data import Dataset, DataLoader

# Define a simple dataset
class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.arange(10)  # Data from 0 to 9
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

# Create dataset and dataloader
my_dataset = MyDataset()
dataloader = DataLoader(my_dataset, batch_size=3, shuffle=True)

# Iterate over batches
for batch in dataloader:
    print(batch)
Output
tensor([3, 1, 0]) tensor([7, 8, 6]) tensor([9, 4, 5]) tensor([2])
⚠️

Common Pitfalls

Common mistakes when using DataLoader include:

  • Not setting shuffle=True when you want random batches, which can cause your model to see data in the same order every epoch.
  • Using num_workers=0 can slow down loading; increasing it speeds up data loading but may cause issues on some systems.
  • For datasets that return multiple items (like input and label), forgetting to unpack batches properly.
python
from torch.utils.data import DataLoader

# Wrong: no shuffle, slow loading
loader_wrong = DataLoader(my_dataset, batch_size=4, shuffle=False, num_workers=0)

# Right: shuffle data and use multiple workers
loader_right = DataLoader(my_dataset, batch_size=4, shuffle=True, num_workers=2)
📊

Quick Reference

ParameterDescriptionDefault
datasetDataset to load data fromNone (required)
batch_sizeNumber of samples per batch1
shuffleShuffle data every epochFalse
num_workersNumber of subprocesses for loading0
drop_lastDrop last incomplete batchFalse
pin_memoryCopy tensors to CUDA pinned memoryFalse

Key Takeaways

Use DataLoader to efficiently load data in batches with optional shuffling and parallelism.
Set shuffle=True during training to improve model generalization by randomizing data order.
Increase num_workers to speed up data loading but test for system compatibility.
Remember to unpack batches correctly if your dataset returns multiple items like inputs and labels.
Use drop_last=True to avoid smaller last batches when batch size consistency is needed.