Why do we often create custom data pipelines when working with real-world data in PyTorch?
Think about how messy real data can be and what that means for loading it.
Real data usually needs cleaning, transforming, or special handling before it can be used for training. Custom pipelines let us do this.
What will be the output of the following code snippet?
from torch.utils.data import Dataset class MyDataset(Dataset): def __init__(self): self.data = [10, 20, 30] def __len__(self): return len(self.data) def __getitem__(self, idx): return self.data[idx] * 2 dataset = MyDataset() print(dataset[1])
Look at what __getitem__ returns for index 1.
The dataset stores [10, 20, 30]. For index 1, the value is 20. The __getitem__ returns value * 2, so 20 * 2 = 40.
You have a dataset of images with varying sizes and some corrupted files. Which data pipeline approach best handles this real data scenario?
Think about how to handle corrupted files and different image sizes automatically.
A custom Dataset can check each image for corruption and resize images dynamically, making the pipeline robust to real data issues.
When using a custom data pipeline in PyTorch, how does increasing the batch size affect training?
Think about how batch size relates to memory and training speed.
Larger batch sizes use more memory and can speed up training but may hurt generalization if too large.
Consider this custom Dataset and DataLoader code. The training hangs indefinitely. What is the most likely cause?
from torch.utils.data import Dataset, DataLoader class HangDataset(Dataset): def __init__(self): self.data = list(range(5)) def __len__(self): return len(self.data) def __getitem__(self, idx): while True: pass # Infinite loop dataset = HangDataset() loader = DataLoader(dataset, batch_size=2, num_workers=2) for batch in loader: print(batch)
Look carefully at the __getitem__ method code.
The infinite loop inside __getitem__ causes the DataLoader workers to hang indefinitely.