In multi-label classification, each example can have many correct labels at once. So, we need metrics that check how well the model predicts all these labels together.
Common metrics are:
- Hamming Loss: Measures how many labels are wrong on average. Lower is better.
- Subset Accuracy: Checks if all predicted labels exactly match the true labels. Very strict.
- Precision, Recall, and F1-score (micro and macro averaged): These show how well the model finds correct labels (Recall), avoids wrong labels (Precision), and balances both (F1).
We use these because simple accuracy doesn't work well when multiple labels can be true or false independently.