When using a custom dataset class in PyTorch, the main goal is to ensure your data loads correctly and efficiently. While this is not a model metric, the key "metric" is data integrity and loading speed. You want to confirm that your dataset class returns the right data samples and labels without errors or delays. This ensures your model training uses the correct inputs and runs smoothly.
Custom Dataset class in PyTorch - Model Metrics & Evaluation
Since a custom dataset class is about data handling, not predictions, a confusion matrix does not apply here. Instead, you can check your dataset by printing sample outputs and their labels to verify correctness.
Sample output from dataset: Index: 0 Image shape: (3, 224, 224) Label: 2 Index: 1 Image shape: (3, 224, 224) Label: 0 ... (and so on)
When designing a custom dataset, you often trade off between loading all data into memory (fast access but high memory use) or loading data on demand (low memory but slower). For example, loading all images at once speeds up training but needs more RAM. Loading images one by one saves memory but can slow training if disk access is slow.
Good: Dataset returns correct data and labels, no crashes, consistent data shapes, and loads data quickly enough to keep training smooth.
Bad: Dataset returns wrong labels, crashes on some indexes, inconsistent data shapes, or is too slow causing training to stall.
- Not implementing
__len__or__getitem__correctly, causing errors. - Returning data in wrong format or shape, confusing the model.
- Mixing up labels and data order.
- Loading all data into memory unintentionally, causing crashes.
- Not handling file paths or corrupt files gracefully.
Your custom dataset class loads images and labels. You notice training is very slow and sometimes crashes with memory errors. What might be wrong?
Answer: Your dataset might be loading all data into memory at once, using too much RAM. Consider loading data on demand in __getitem__ to reduce memory use and speed up training.