Semi-supervised learning uses both labeled and unlabeled data. The key metric depends on the task, often classification accuracy, precision, recall, or F1 score. We focus on metrics that show how well the model learns from limited labels and generalizes to new data. For example, if the goal is to find rare cases, recall is important. If avoiding false alarms matters, precision is key. Accuracy alone can be misleading if classes are imbalanced.
Semi-supervised learning basics in ML Python - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Predicted
Pos Neg
Actual Pos 40 10
Neg 15 35
Total samples = 40 + 10 + 15 + 35 = 100
Precision = TP / (TP + FP) = 40 / (40 + 15) = 0.727
Recall = TP / (TP + FN) = 40 / (40 + 10) = 0.8
F1 = 2 * (0.727 * 0.8) / (0.727 + 0.8) ≈ 0.761
Accuracy = (TP + TN) / Total = (40 + 35) / 100 = 0.75In semi-supervised learning, the model may guess labels for unlabeled data. If it guesses too many positives, precision drops (more false alarms). If it guesses too few, recall drops (misses real positives).
Example 1: Detecting spam emails. High precision means few good emails marked as spam. Better to avoid false alarms, so precision matters more.
Example 2: Detecting diseases. High recall means catching most sick patients. Missing a sick patient is worse, so recall matters more.
Good: Balanced precision and recall above 0.7, F1 score above 0.7, accuracy reflecting true performance on labeled and unlabeled data.
Bad: High accuracy but very low recall or precision, indicating the model ignores minority classes or guesses poorly on unlabeled data.
- Accuracy paradox: High accuracy can hide poor performance on rare classes.
- Data leakage: Using unlabeled data incorrectly can leak test info, inflating metrics.
- Overfitting: Model fits labeled data too closely but fails on unlabeled data, causing misleading metrics.
Your semi-supervised model has 98% accuracy but only 12% recall on the positive class (rare cases). Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most positive cases (low recall), which is critical if those cases matter. High accuracy is misleading because negatives dominate the data.
Practice
semi-supervised learning in machine learning?Solution
Step 1: Understand the data types in semi-supervised learning
Semi-supervised learning uses a mix of labeled and unlabeled data to improve model training.Step 2: Compare options with the definition
Using both labeled and unlabeled data to train a model correctly states the use of both labeled and unlabeled data, unlike other options which mention only one type or unrelated concepts.Final Answer:
Using both labeled and unlabeled data to train a model -> Option CQuick Check:
Semi-supervised learning = labeled + unlabeled data [OK]
- Confusing semi-supervised with supervised learning
- Thinking it uses only unlabeled data
- Assuming it trains multiple models separately
Solution
Step 1: Identify methods specific to semi-supervised learning
Self-training is a popular semi-supervised method where the model labels unlabeled data iteratively.Step 2: Eliminate unrelated methods
Gradient boosting and decision trees are supervised learning methods; K-means is unsupervised clustering, not semi-supervised.Final Answer:
Self-training -> Option AQuick Check:
Semi-supervised method = Self-training [OK]
- Confusing supervised methods as semi-supervised
- Choosing clustering as semi-supervised
- Not knowing self-training meaning
from sklearn.semi_supervised import LabelSpreading import numpy as np X = np.array([[1], [2], [3], [4], [5]]) y = np.array([0, 1, -1, -1, -1]) # -1 means unlabeled model = LabelSpreading() model.fit(X, y) preds = model.transduction_ print(preds)What will be the output printed by
print(preds)?Solution
Step 1: Understand label spreading behavior
Label spreading propagates labels from labeled points (0 and 1) to unlabeled points (-1) based on similarity.Step 2: Predict labels for unlabeled points
Since points 2,3,4 are close to labeled point 1, they get label 1. Points 0 and 1 keep their labels 0 and 1.Final Answer:
[0 1 1 1 1] -> Option DQuick Check:
Label spreading fills unlabeled with nearest labels [OK]
- Assuming unlabeled points remain -1
- Thinking labels spread to 0 instead of 1
- Confusing output with input labels
from sklearn.semi_supervised import SelfTrainingClassifier from sklearn.svm import SVC X = [[1], [2], [3], [4]] y = [0, 1, -1, -1] base_model = SVC() model = SelfTrainingClassifier(base_model) model.fit(X, y)What is the error in this code?
Solution
Step 1: Check requirements for SelfTrainingClassifier base model
SelfTrainingClassifier needs base model to provide probability estimates, so SVC must be initialized with probability=True.Step 2: Identify the missing argument
The code uses default SVC without probability=True, causing an error during fit.Final Answer:
SVC requires probability=True for self-training -> Option BQuick Check:
SelfTrainingClassifier needs probabilistic base model [OK]
- Thinking -1 labels are invalid
- Believing lists can't be used as input
- Assuming SVC can't be base model
Solution
Step 1: Understand the problem with few labeled samples
With only 50 labeled samples, training a model directly may not generalize well.Step 2: Choose a semi-supervised method to leverage unlabeled data
Self-training uses the base classifier to label unlabeled data iteratively, improving learning without costly manual labeling.Final Answer:
Use self-training with a base classifier that predicts labels on unlabeled data iteratively -> Option AQuick Check:
Semi-supervised learning improves with self-training on unlabeled data [OK]
- Ignoring unlabeled data wastes valuable information
- Assuming manual labeling is always feasible
- Confusing clustering with semi-supervised learning
