ML Pythonml~8 mins

Multi-label classification in ML Python - Model Metrics & Evaluation

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Metrics & Evaluation - Multi-label classification

Which metric matters for Multi-label classification and WHY

In multi-label classification, each example can have many correct labels at once. So, we need metrics that check how well the model predicts all these labels together.

Common metrics are:

Hamming Loss: Measures how many labels are wrong on average. Lower is better.
Subset Accuracy: Checks if all predicted labels exactly match the true labels. Very strict.
Precision, Recall, and F1-score (micro and macro averaged): These show how well the model finds correct labels (Recall), avoids wrong labels (Precision), and balances both (F1).

We use these because simple accuracy doesn't work well when multiple labels can be true or false independently.

Confusion matrix or equivalent visualization

For multi-label, confusion matrices are made per label. For example, for label A:

      | Predicted Yes | Predicted No |
      |--------------|--------------|
      | True Pos (TP) | False Neg (FN)|
      | False Pos (FP)| True Neg (TN) |

We calculate TP, FP, TN, FN for each label separately, then combine results for overall metrics.

Example for 3 labels (A, B, C) with 4 samples:

    Samples:       Label A  Label B  Label C
    True labels:    1        0        1
                   0        1        0
                   1        1        0
                   0        0        1
    Predicted:     1        0        0
                   0        1        1
                   1        0        0
                   0        0        1

We count TP, FP, TN, FN for each label and then compute metrics.

Precision vs Recall tradeoff with concrete examples

In multi-label tasks, precision means how many predicted labels are actually correct. Recall means how many true labels the model found.

Example: A music app tags songs with genres. If the model predicts many genres per song (high recall), it may include wrong ones (low precision). If it predicts fewer genres (high precision), it might miss some true genres (low recall).

Depending on the goal, you choose:

High precision: Avoid wrong tags, good for user trust.
High recall: Find all possible tags, good for discovery.

F1-score balances both.

What "good" vs "bad" metric values look like for multi-label classification

Good:

Hamming Loss close to 0 (few wrong labels)
Subset Accuracy above 0.7 (exact matches often)
Precision, Recall, F1 above 0.8 (balanced and strong)

Bad:

Hamming Loss near 0.5 or higher (many wrong labels)
Subset Accuracy near 0 (rare exact matches)
Precision or Recall below 0.5 (poor label prediction)

Remember, subset accuracy is strict and often low, so focus on F1 and Hamming Loss for practical insight.

Common pitfalls in multi-label classification metrics

Ignoring label imbalance: Some labels appear rarely. Macro averaging treats all labels equally, micro averaging weights by frequency.
Using accuracy alone: It can be misleading because predicting no labels can give high accuracy if most labels are negative.
Data leakage: If test data leaks into training, metrics look falsely good.
Overfitting: Very high training metrics but low test metrics mean the model memorizes instead of generalizing.

Self-check question

Your multi-label model has 98% accuracy but only 12% recall averaged over labels. Is it good for production? Why or why not?

Answer: No, it is not good. The high accuracy likely comes from predicting mostly negative labels correctly (many labels are absent). The very low recall means the model misses most true labels, so it fails to find what it should. This hurts usefulness in real tasks.

Key Result

In multi-label classification, balanced metrics like F1-score and Hamming Loss best show model quality because simple accuracy can be misleading.

Practice

(1/5)

1. What is the main difference between multi-label classification and multi-class classification?

easy

A. Multi-label classification uses regression, multi-class uses classification.

B. Multi-label classification assigns only one label, multi-class assigns multiple labels.

C. Multi-label classification is used only for images, multi-class for text.

D. Multi-label classification assigns multiple labels to one example, multi-class assigns only one.

Multi-label classification in ML Python - Model Metrics & Evaluation

Start learning this pattern below

Practice

Solution

Step 1: Understand multi-label classification

Step 2: Compare with multi-class classification

Final Answer:

Quick Check:

Solution

Step 1: Understand label representation for multi-label

Step 2: Check options for correct format

Final Answer:

Quick Check:

Solution

Step 1: Apply threshold to predictions

Step 2: Convert boolean to int and print

Final Answer:

Quick Check:

Solution

Step 1: Understand output activations for multi-label

Step 2: Identify problem with softmax

Final Answer:

Quick Check:

Solution

Step 1: Understand evaluation needs for multi-label

Step 2: Choose suitable metrics

Final Answer:

Quick Check: