Annotation quality means how correct and consistent the labels or marks on images are. Good annotations help the model learn well. The key metrics to check annotation quality are Inter-annotator Agreement and Consistency Scores. These show if different people label the same images similarly. For object detection, Intersection over Union (IoU) measures how well bounding boxes match. High agreement and IoU mean better annotation quality, which leads to better model training.
Annotation quality in Computer Vision - Model Metrics & Evaluation
For annotation quality, confusion matrix compares annotators' labels. Example for 3 classes (Cat, Dog, Bird):
| | Cat | Dog | Bird |
|-------|-----|-----|------|
| Cat | 45 | 3 | 2 |
| Dog | 4 | 40 | 6 |
| Bird | 1 | 5 | 44 |
This shows how often annotators agree (diagonal) or disagree (off-diagonal). High diagonal numbers mean good annotation quality.
In annotation, precision means how many labeled objects are correct, recall means how many true objects are labeled.
- High precision, low recall: Annotators label only very clear objects, missing some. Model learns fewer examples but with high confidence.
- High recall, low precision: Annotators label many objects, including uncertain ones, causing some wrong labels. Model learns more but with noise.
Good annotation balances precision and recall to provide enough correct examples without too many mistakes.
- Good: Inter-annotator agreement > 0.8 (80%), IoU > 0.75, consistent labels across annotators.
- Bad: Agreement < 0.6 (60%), IoU < 0.5, many conflicting labels or missing annotations.
Good values mean the model will learn from reliable data. Bad values mean the model may learn wrong patterns.
- Ignoring annotator bias: Some annotators may be stricter or more lenient, skewing agreement.
- Data leakage: Using test images in annotation checks can falsely inflate agreement.
- Overfitting to noisy labels: Model may memorize wrong annotations if quality is poor.
- Accuracy paradox: High overall accuracy but poor class-wise agreement hides annotation issues.
Your annotation team has 98% agreement on easy images but only 50% on hard images. Is your annotation quality good enough? Why or why not?
Answer: No, because low agreement on hard images means inconsistent labels where the model needs to learn most. This can hurt model performance on challenging cases.