When using image datasets loaded from folders, the key metrics depend on the task. For classification, accuracy is often used to see how many images are correctly labeled. However, if classes are imbalanced, precision and recall become important to understand how well the model finds each class. These metrics help check if the dataset loading and labeling from folders is correct and if the model learns well.
Image dataset from folders in PyTorch - Model Metrics & Evaluation
Predicted
Cat Dog
Actual Cat 45 5
Dog 3 47
TP (Cat) = 45, FP (Cat) = 3, FN (Cat) = 5, TN (Cat) = 47
Total samples = 100This matrix shows how many images were correctly or wrongly classified after loading from folders.
Precision tells us how many images predicted as a class are actually that class. Recall tells us how many images of a class were found by the model.
Example: If you have a folder with many dog images but few cat images, high recall for cats means the model finds most cat images. High precision means when the model says "cat," it is usually correct.
Depending on the use, you might want to optimize for one. For example, in a pet app, you want high precision to avoid wrong labels. In wildlife monitoring, high recall is important to not miss rare animals.
Good: Accuracy above 90%, precision and recall above 85% for all classes means the dataset from folders is well loaded and the model learns well.
Bad: Accuracy below 70%, or very low precision/recall for some classes means problems like mislabeled folders, unbalanced data, or model issues.
- Accuracy paradox: High accuracy can be misleading if one class dominates the dataset loaded from folders.
- Data leakage: If images from the same folder appear in both training and testing sets, metrics will be too optimistic.
- Overfitting indicators: Very high training accuracy but low test accuracy means the model memorizes folder images but does not generalize.
Your model trained on images loaded from folders has 98% accuracy but only 12% recall on a rare class. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most images of the rare class (low recall), which means it fails to detect important cases even though overall accuracy is high.