PyTorchml~8 mins

Built-in datasets (torchvision.datasets) in PyTorch - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Built-in datasets (torchvision.datasets)

Which metric matters for Built-in datasets (torchvision.datasets) and WHY

When using built-in datasets like those in torchvision.datasets, the key metrics depend on the task you train your model for. For example, if you train a classifier on CIFAR-10, accuracy is a simple and clear metric to see how well your model predicts the correct class.

However, if the dataset is imbalanced (some classes appear more than others), accuracy alone can be misleading. In that case, metrics like precision, recall, and F1 score become important to understand how well the model performs on each class.

In summary, the dataset provides the data, but the metric you choose depends on your model's goal and the dataset's balance.

Confusion matrix example for a 3-class classification

       Predicted
       C1  C2  C3
    T1  50  2   3
    T2  4   45  1
    T3  5   2   48

Here, T1, T2, T3 are true classes, and C1, C2, C3 are predicted classes.

This matrix helps calculate precision and recall for each class, showing where the model confuses classes.

Precision vs Recall tradeoff with built-in datasets

Imagine you use the MNIST dataset to train a digit recognizer. If your model is very cautious and only predicts a digit when very sure, it may have high precision (few wrong guesses) but low recall (misses many digits).

On the other hand, if it guesses more often, recall improves (finds more digits), but precision drops (more wrong guesses).

Choosing the right balance depends on your goal. For example, in medical image datasets, missing a disease (low recall) is worse than a false alarm (low precision).

Good vs Bad metric values for models trained on built-in datasets

Good metrics:

High accuracy (e.g., >90% on CIFAR-10) means the model predicts most images correctly.
Balanced precision and recall across classes show the model is fair and reliable.
F1 score close to accuracy indicates no big tradeoff issues.

Bad metrics:

High accuracy but very low recall on some classes means the model ignores those classes.
Very low precision means many wrong predictions, confusing the model.
Large difference between training and validation accuracy suggests overfitting.

Common pitfalls when evaluating models on built-in datasets

Accuracy paradox: High accuracy can be misleading if classes are imbalanced.
Data leakage: Accidentally using test data during training inflates metrics falsely.
Overfitting: Model performs well on training data but poorly on new data.
Ignoring class imbalance: Not checking per-class metrics hides poor performance on rare classes.

Self-check question

Your model trained on a built-in dataset has 98% accuracy but only 12% recall on a rare class. Is it good for production? Why or why not?

Answer: No, it is not good. The low recall means the model misses most examples of the rare class, which could be critical depending on the task. High accuracy is misleading here because the rare class is small compared to others. You need to improve recall to catch more of the rare class.

Key Result

Accuracy is a simple starting point, but precision, recall, and F1 score reveal true model performance on built-in datasets.