0
0
Computer Visionml~8 mins

Training an image classifier in Computer Vision - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Training an image classifier
Which metric matters for training an image classifier and WHY

When training an image classifier, the main goal is to correctly identify images into their right categories. The key metrics to watch are accuracy, precision, recall, and F1 score.

Accuracy tells us the overall percentage of images the model got right. But accuracy alone can be misleading if some classes appear much more than others.

Precision measures how many images the model labeled as a certain class actually belong to that class. This is important when false alarms are costly.

Recall shows how many images of a class the model successfully found out of all images that truly belong to that class. This matters when missing a class is bad.

F1 score balances precision and recall, giving a single number to understand the model's quality on each class.

For image classifiers, especially with multiple classes, it's good to look at these metrics per class and also overall.

Confusion matrix example

Imagine a simple 3-class image classifier for cats, dogs, and rabbits. Here is a confusion matrix showing predictions vs actual labels:

          Predicted
          Cat  Dog  Rabbit
    Actual
    Cat     50    2     3
    Dog      4   45     1
    Rabbit   2    3    40
    

Explanation:

  • True Positives (TP) for Cat = 50 (correctly predicted cats)
  • False Positives (FP) for Cat = 4 + 2 = 6 (dogs and rabbits wrongly predicted as cats)
  • False Negatives (FN) for Cat = 2 + 3 = 5 (cats wrongly predicted as dogs or rabbits)
  • True Negatives (TN) for Cat = total samples - TP - FP - FN

We can calculate precision, recall, and F1 for each class from this matrix.

Precision vs Recall tradeoff with examples

In image classification, sometimes you want to be very sure when the model says an image is a certain class (high precision). For example, if the model detects rare animals, you want few false alarms.

Other times, you want to catch as many images of a class as possible (high recall). For example, if the model finds defective products, missing any defect is costly.

Improving precision often lowers recall and vice versa. The F1 score helps balance this tradeoff.

What good vs bad metric values look like

Good metrics for an image classifier might be:

  • Accuracy above 85% on a balanced dataset
  • Precision and recall above 80% for each important class
  • F1 scores close to precision and recall, showing balance

Bad metrics might be:

  • High accuracy but very low recall on some classes (model misses many images)
  • High precision but very low recall (model is too cautious and misses many true images)
  • Very low overall accuracy (model guesses poorly)
Common pitfalls in metrics for image classifiers
  • Accuracy paradox: High accuracy can hide poor performance if classes are imbalanced (e.g., 95% accuracy by always guessing the largest class).
  • Data leakage: If test images are too similar to training images, metrics look better than real performance.
  • Overfitting: Very high training accuracy but low test accuracy means the model memorizes training images but fails on new ones.
  • Ignoring per-class metrics: Overall accuracy can hide poor results on rare classes.
Self-check question

Your image classifier has 98% accuracy but only 12% recall on a rare but important class. Is this model good for production? Why or why not?

Answer: No, it is not good. The low recall means the model misses most images of that important class, which can be critical depending on the use case. High accuracy is misleading here because the rare class is small compared to others.

Key Result
Accuracy alone can be misleading; precision, recall, and F1 per class give a clearer picture of image classifier quality.