0
0
PyTorchml~3 mins

Why Dataset class (custom datasets) in PyTorch? - Purpose & Use Cases

Choose your learning style9 modes available
The Big Idea

What if you could stop wasting hours on loading data and start training your model faster and smarter?

The Scenario

Imagine you have hundreds of images stored in different folders, each representing a category, and you need to load them one by one to train a model.

You try to open each file manually, read it, and label it correctly before feeding it to your program.

The Problem

This manual approach is slow and tiring because you have to write repetitive code for loading and labeling each file.

It's easy to make mistakes like mixing up labels or forgetting files, and updating your code for new data becomes a headache.

The Solution

The Dataset class in PyTorch lets you create a custom way to load and organize your data automatically.

You write simple code once to tell it how to get each item and its label, and then PyTorch handles the rest efficiently.

Before vs After
Before
images = []
labels = []
for file in files:
    img = open_image(file)
    label = get_label(file)
    images.append(img)
    labels.append(label)
After
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, files):
        self.files = files
    def __len__(self):
        return len(self.files)
    def __getitem__(self, idx):
        img = open_image(self.files[idx])
        label = get_label(self.files[idx])
        return img, label
What It Enables

It makes loading, transforming, and managing large and complex datasets easy and error-free, so you can focus on building your model.

Real Life Example

For example, when training a model to recognize different types of animals from thousands of photos stored in folders, a custom Dataset class can automatically load and label each photo correctly without manual effort.

Key Takeaways

Manual data loading is slow and error-prone.

Custom Dataset class automates data handling.

Simplifies working with complex or large datasets.