When working with custom datasets in PyTorch, the key metric is data loading correctness and efficiency. This means your dataset class must correctly load and return the right data samples and labels without errors. While this is not a model metric like accuracy, it is critical because bad data loading leads to wrong training and poor model results. Efficiency matters too, so training is not slowed down.
Dataset class (custom datasets) in PyTorch - Model Metrics & Evaluation
For dataset classes, we don't have a confusion matrix. Instead, we check data integrity by verifying the number of samples matches expectations and that each sample-label pair is correct. For example:
Dataset size: 1000 samples
Sample 0: image shape (3, 224, 224), label: 5
Sample 999: image shape (3, 224, 224), label: 2
This ensures the dataset class correctly loads all data.
There is a tradeoff between loading data correctly and loading it fast. If you load all data into memory at once, loading is fast but uses lots of RAM. If you load data on the fly, it uses less memory but can slow training. The goal is to balance correctness (no errors, right labels) with efficiency (fast enough to keep training smooth).
Good: Dataset returns correct samples and labels, length matches dataset size, no crashes during training, data shapes are consistent, and loading speed keeps up with training.
Bad: Dataset returns wrong labels, crashes on some indexes, length is wrong, data shapes vary unexpectedly, or loading is too slow causing training delays.
- Mixing up labels and data order causing wrong training signals.
- Not implementing
__len__or__getitem__correctly. - Loading all data into memory causing crashes on large datasets.
- Slow data loading blocking GPU training.
- Data leakage by accidentally including test data in training dataset.
Your custom dataset class returns 1000 samples but during training, the model gets random results and loss does not improve. What could be wrong?
Answer: The dataset might be returning wrong labels or mismatched data-label pairs. Check your __getitem__ method to ensure correct data loading.