Experiment - Why custom data pipelines handle real data

Problem:You want to train a model on real-world images, but the built-in data loaders do not fit your data format or preprocessing needs.

Current Metrics:Training accuracy: 90%, Validation accuracy: 60%

Issue:The model overfits because the data pipeline does not properly preprocess or augment the real data, causing poor validation performance.

Your Task

Create a custom PyTorch data pipeline that correctly loads, preprocesses, and augments real images to reduce overfitting and improve validation accuracy to above 75%.

Use PyTorch Dataset and DataLoader classes.

Do not change the model architecture.

Keep batch size and optimizer settings the same.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

PyTorch

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import os

class CustomImageDataset(Dataset):
    def __init__(self, img_dir, labels_file, transform=None):
        self.img_dir = img_dir
        self.transform = transform
        with open(labels_file, 'r') as f:
            lines = f.readlines()
        self.img_labels = [line.strip().split() for line in lines]

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels[idx][0])
        image = Image.open(img_path).convert('RGB')
        label = int(self.img_labels[idx][1])
        if self.transform:
            image = self.transform(image)
        return image, label

# Define transforms including augmentation
transform = transforms.Compose([
    transforms.Resize((64, 64)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

# Create dataset and dataloader
train_dataset = CustomImageDataset(img_dir='train_images', labels_file='train_labels.txt', transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Example training loop snippet
# for images, labels in train_loader:
#     outputs = model(images)
#     loss = criterion(outputs, labels)
#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()

Created a custom Dataset class to load images and labels from files.

Added image resizing, normalization, and random horizontal flip for augmentation.

Used DataLoader with shuffling to improve training data randomness.

Kept model and training parameters unchanged.

Results Interpretation

Before: Training accuracy 90%, Validation accuracy 60% (overfitting)

After: Training accuracy 88%, Validation accuracy 78% (better generalization)

Custom data pipelines let you properly prepare and augment real data, which helps the model learn better and reduces overfitting.

Bonus Experiment

Try adding more data augmentations like random rotations and color jitter to see if validation accuracy improves further.

💡 Hint

Use torchvision.transforms.RandomRotation and transforms.ColorJitter in the transform pipeline.