Batch size controls how many samples the model sees before updating. Shuffling mixes data to help the model learn better.
0
0
Batch size and shuffling in PyTorch
Introduction
When training a model on a large dataset that can't fit in memory all at once.
When you want to speed up training by processing multiple samples at once.
When you want to avoid the model learning the order of the data by mixing it each epoch.
When you want to balance memory use and training speed by choosing batch size.
When you want to improve model generalization by shuffling data every epoch.
Syntax
PyTorch
DataLoader(dataset, batch_size=number, shuffle=True_or_False)
batch_size is how many samples you get in one go.
shuffle=True means data order changes every time you start training.
Examples
Loads data in groups of 32 and mixes the order each epoch.
PyTorch
DataLoader(my_dataset, batch_size=32, shuffle=True)
Loads data in groups of 64 without changing order.
PyTorch
DataLoader(my_dataset, batch_size=64, shuffle=False)
Loads one sample at a time, shuffling data each epoch.
PyTorch
DataLoader(my_dataset, batch_size=1, shuffle=True)
Sample Model
This code shows how batch size groups samples and how shuffling changes the order of data in batches.
PyTorch
import torch from torch.utils.data import DataLoader, TensorDataset # Create a simple dataset of 10 samples x = torch.arange(10).unsqueeze(1).float() # Features: 0 to 9 # Dummy labels (just zeros) y = torch.zeros(10) # Create dataset dataset = TensorDataset(x, y) # DataLoader with batch size 3 and shuffling enabled loader = DataLoader(dataset, batch_size=3, shuffle=True) print("Batches with shuffle=True:") for batch_idx, (features, labels) in enumerate(loader): print(f"Batch {batch_idx + 1}: features = {features.squeeze().tolist()}") # DataLoader with batch size 3 and shuffling disabled loader_no_shuffle = DataLoader(dataset, batch_size=3, shuffle=False) print("\nBatches with shuffle=False:") for batch_idx, (features, labels) in enumerate(loader_no_shuffle): print(f"Batch {batch_idx + 1}: features = {features.squeeze().tolist()}")
OutputSuccess
Important Notes
Choosing a batch size that is a power of 2 (like 32, 64) can be faster on some hardware.
Shuffling is important to prevent the model from learning the order of data, which can cause poor generalization.
Very large batch sizes may use too much memory and slow down training.
Summary
Batch size controls how many samples the model sees before updating.
Shuffling mixes data order to help the model learn better and avoid bias.
Use DataLoader in PyTorch to set batch size and shuffle easily.