Semi-supervised learning uses both labeled and unlabeled data. The key metric depends on the task, often classification accuracy, precision, recall, or F1 score. We focus on metrics that show how well the model learns from limited labels and generalizes to new data. For example, if the goal is to find rare cases, recall is important. If avoiding false alarms matters, precision is key. Accuracy alone can be misleading if classes are imbalanced.
Semi-supervised learning basics in ML Python - Model Metrics & Evaluation
Predicted
Pos Neg
Actual Pos 40 10
Neg 15 35
Total samples = 40 + 10 + 15 + 35 = 100
Precision = TP / (TP + FP) = 40 / (40 + 15) = 0.727
Recall = TP / (TP + FN) = 40 / (40 + 10) = 0.8
F1 = 2 * (0.727 * 0.8) / (0.727 + 0.8) ≈ 0.761
Accuracy = (TP + TN) / Total = (40 + 35) / 100 = 0.75In semi-supervised learning, the model may guess labels for unlabeled data. If it guesses too many positives, precision drops (more false alarms). If it guesses too few, recall drops (misses real positives).
Example 1: Detecting spam emails. High precision means few good emails marked as spam. Better to avoid false alarms, so precision matters more.
Example 2: Detecting diseases. High recall means catching most sick patients. Missing a sick patient is worse, so recall matters more.
Good: Balanced precision and recall above 0.7, F1 score above 0.7, accuracy reflecting true performance on labeled and unlabeled data.
Bad: High accuracy but very low recall or precision, indicating the model ignores minority classes or guesses poorly on unlabeled data.
- Accuracy paradox: High accuracy can hide poor performance on rare classes.
- Data leakage: Using unlabeled data incorrectly can leak test info, inflating metrics.
- Overfitting: Model fits labeled data too closely but fails on unlabeled data, causing misleading metrics.
Your semi-supervised model has 98% accuracy but only 12% recall on the positive class (rare cases). Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most positive cases (low recall), which is critical if those cases matter. High accuracy is misleading because negatives dominate the data.