0
0
PyTorchml~20 mins

Why custom data pipelines handle real data in PyTorch - Experiment to Prove It

Choose your learning style9 modes available
Experiment - Why custom data pipelines handle real data
Problem:You want to train a model on real-world images, but the built-in data loaders do not fit your data format or preprocessing needs.
Current Metrics:Training accuracy: 90%, Validation accuracy: 60%
Issue:The model overfits because the data pipeline does not properly preprocess or augment the real data, causing poor validation performance.
Your Task
Create a custom PyTorch data pipeline that correctly loads, preprocesses, and augments real images to reduce overfitting and improve validation accuracy to above 75%.
Use PyTorch Dataset and DataLoader classes.
Do not change the model architecture.
Keep batch size and optimizer settings the same.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
PyTorch
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import os

class CustomImageDataset(Dataset):
    def __init__(self, img_dir, labels_file, transform=None):
        self.img_dir = img_dir
        self.transform = transform
        with open(labels_file, 'r') as f:
            lines = f.readlines()
        self.img_labels = [line.strip().split() for line in lines]

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels[idx][0])
        image = Image.open(img_path).convert('RGB')
        label = int(self.img_labels[idx][1])
        if self.transform:
            image = self.transform(image)
        return image, label

# Define transforms including augmentation
transform = transforms.Compose([
    transforms.Resize((64, 64)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

# Create dataset and dataloader
train_dataset = CustomImageDataset(img_dir='train_images', labels_file='train_labels.txt', transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Example training loop snippet
# for images, labels in train_loader:
#     outputs = model(images)
#     loss = criterion(outputs, labels)
#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()
Created a custom Dataset class to load images and labels from files.
Added image resizing, normalization, and random horizontal flip for augmentation.
Used DataLoader with shuffling to improve training data randomness.
Kept model and training parameters unchanged.
Results Interpretation

Before: Training accuracy 90%, Validation accuracy 60% (overfitting)

After: Training accuracy 88%, Validation accuracy 78% (better generalization)

Custom data pipelines let you properly prepare and augment real data, which helps the model learn better and reduces overfitting.
Bonus Experiment
Try adding more data augmentations like random rotations and color jitter to see if validation accuracy improves further.
💡 Hint
Use torchvision.transforms.RandomRotation and transforms.ColorJitter in the transform pipeline.