For face detection using deep neural networks, the key metrics are Precision and Recall. Precision tells us how many detected faces are actually faces, so it measures false alarms. Recall tells us how many real faces the model finds, so it measures missed faces. Both matter because we want to find as many faces as possible (high recall) without wrongly marking non-faces as faces (high precision). The F1 score balances these two. Also, the Average Precision (AP) over different confidence thresholds is often used to summarize performance.
DNN-based face detection in Computer Vision - Model Metrics & Evaluation
| Predicted Face | Predicted No Face |
|---------------|-------------------|
| True Face (TP) 90 10 (FN) |
| False Face (FP) 15 885 (TN) |
Total samples = 1000
Precision = TP / (TP + FP) = 90 / (90 + 15) = 0.857
Recall = TP / (TP + FN) = 90 / (90 + 10) = 0.9
F1 Score = 2 * (0.857 * 0.9) / (0.857 + 0.9) ≈ 0.878
If the model is tuned to be very strict, it will only detect faces when very sure. This means high precision (few false alarms) but low recall (misses many faces). This is good if false alarms are costly, like in security checks.
If the model is tuned to detect as many faces as possible, it will catch almost all faces (high recall) but also mark some non-faces as faces (low precision). This is useful in photo apps where missing a face is worse than a few false detections.
Choosing the right balance depends on the application needs.
- Good: Precision and Recall both above 0.85, F1 score near 0.9 or higher. This means most faces are found and few false alarms.
- Bad: Precision or Recall below 0.5 means many false alarms or many missed faces. For example, Precision 0.4 means more than half detected faces are wrong.
- Very high accuracy alone can be misleading if the dataset has many non-face images (class imbalance).
- Accuracy paradox: If most images have no faces, a model that always predicts no face can have high accuracy but is useless.
- Data leakage: Testing on images very similar to training images inflates metrics falsely.
- Overfitting: Very high training metrics but low test metrics means the model memorizes training faces but fails on new ones.
- Ignoring IoU thresholds: Face detection usually requires bounding boxes to overlap enough with ground truth. Metrics must consider this overlap.
Your face detection model has 98% accuracy but only 12% recall on faces. Is it good for production? Why or why not?
Answer: No, it is not good. The high accuracy is misleading because most images may not have faces. The very low recall means the model misses 88% of faces, which defeats the purpose of face detection.