PytorchHow-ToBeginner · 3 min read

How to Use DataLoader in PyTorch: Syntax and Example

Use torch.utils.data.DataLoader to load data in batches, shuffle it, and enable parallel loading in PyTorch. It wraps a dataset and provides an iterable over the data with options like batch_size, shuffle, and num_workers.

📐

Syntax

The DataLoader is created by passing a dataset and specifying parameters like batch size and shuffling. Key parameters include:

dataset: The dataset object to load data from.
batch_size: Number of samples per batch.
shuffle: Whether to shuffle data each epoch.
num_workers: Number of subprocesses for data loading.

python

from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)

💻

Example

This example shows how to create a simple dataset and use DataLoader to iterate over batches of data with shuffling enabled.

python

import torch
from torch.utils.data import Dataset, DataLoader

# Define a simple dataset
class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.arange(10)  # Data from 0 to 9
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

# Create dataset and dataloader
my_dataset = MyDataset()
dataloader = DataLoader(my_dataset, batch_size=3, shuffle=True)

# Iterate over batches
for batch in dataloader:
    print(batch)

Output

tensor([3, 1, 0]) tensor([7, 8, 6]) tensor([9, 4, 5]) tensor([2])

⚠️

Common Pitfalls

Common mistakes when using DataLoader include:

Not setting shuffle=True when you want random batches, which can cause your model to see data in the same order every epoch.
Using num_workers=0 can slow down loading; increasing it speeds up data loading but may cause issues on some systems.
For datasets that return multiple items (like input and label), forgetting to unpack batches properly.

python

from torch.utils.data import DataLoader

# Wrong: no shuffle, slow loading
loader_wrong = DataLoader(my_dataset, batch_size=4, shuffle=False, num_workers=0)

# Right: shuffle data and use multiple workers
loader_right = DataLoader(my_dataset, batch_size=4, shuffle=True, num_workers=2)

📊

Quick Reference

Parameter	Description	Default
dataset	Dataset to load data from	None (required)
batch_size	Number of samples per batch	1
shuffle	Shuffle data every epoch	False
num_workers	Number of subprocesses for loading	0
drop_last	Drop last incomplete batch	False
pin_memory	Copy tensors to CUDA pinned memory	False

✅

Key Takeaways

Use DataLoader to efficiently load data in batches with optional shuffling and parallelism.

Set shuffle=True during training to improve model generalization by randomizing data order.

Increase num_workers to speed up data loading but test for system compatibility.

Remember to unpack batches correctly if your dataset returns multiple items like inputs and labels.

Use drop_last=True to avoid smaller last batches when batch size consistency is needed.