Bird
Raised Fist0
ML Pythonml~5 mins

Semi-supervised learning basics in ML Python

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Semi-supervised learning helps computers learn from a small amount of labeled data and a large amount of unlabeled data, making learning easier and cheaper.
When you have a few labeled photos but many unlabeled ones and want to teach a computer to recognize objects.
When labeling data is expensive or slow, like medical images, but you have many unlabeled examples.
When you want to improve a model's accuracy by using extra unlabeled data alongside labeled data.
When you want to build a spam filter but only have a small set of emails marked as spam or not.
When you want to cluster or group data but also have some known examples to guide the process.
Syntax
ML Python
1. Start with a small labeled dataset and a large unlabeled dataset.
2. Train a model on the labeled data.
3. Use the model to predict labels for the unlabeled data.
4. Add the most confident predictions to the labeled set.
5. Retrain the model with the expanded labeled set.
6. Repeat steps 3-5 until satisfied.
This process is often called self-training or pseudo-labeling.
Confidence means how sure the model is about its prediction.
Examples
This shows the basic loop of training, predicting, selecting confident predictions, and retraining.
ML Python
# Example of pseudo-labeling steps
labeled_data = small_labeled_set
unlabeled_data = large_unlabeled_set
model.train(labeled_data)
predictions = model.predict(unlabeled_data)
confident_preds = select_confident(predictions)
labeled_data += confident_preds
model.train(labeled_data)
LabelSpreading is a built-in semi-supervised method that spreads label information to unlabeled points.
ML Python
# Using scikit-learn LabelSpreading
from sklearn.semi_supervised import LabelSpreading
model = LabelSpreading()
model.fit(X_train, y_train_with_missing_labels)
predicted_labels = model.transduction_
Sample Model
This program uses a real dataset of handwritten digits. It hides half the labels to simulate unlabeled data. Then it trains a semi-supervised model to guess the missing labels and checks accuracy on the known labels.
ML Python
from sklearn import datasets
from sklearn.semi_supervised import LabelSpreading
from sklearn.metrics import accuracy_score
import numpy as np

# Load digits dataset
digits = datasets.load_digits()
X = digits.data
y = digits.target

# Remove labels for 50% of data to simulate unlabeled data
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(y)) < 0.5
y_missing = np.copy(y)
y_missing[random_unlabeled_points] = -1  # -1 means unlabeled

# Create and train LabelSpreading model
model = LabelSpreading(kernel='knn', alpha=0.8)
model.fit(X, y_missing)

# Predict labels for all data
y_pred = model.transduction_

# Calculate accuracy only on originally labeled points
accuracy = accuracy_score(y[~random_unlabeled_points], y_pred[~random_unlabeled_points])
print(f"Accuracy on labeled data: {accuracy:.2f}")
OutputSuccess
Important Notes
Semi-supervised learning works best when unlabeled data is similar to labeled data.
Be careful: adding wrong pseudo-labels can confuse the model.
Choosing a good confidence threshold for adding pseudo-labels is important.
Summary
Semi-supervised learning uses both labeled and unlabeled data to improve learning.
It is useful when labeling data is costly or time-consuming.
Common methods include self-training and label spreading.

Practice

(1/5)
1. What is the main idea behind semi-supervised learning in machine learning?
easy
A. Using only unlabeled data to train a model
B. Using only labeled data to train a model
C. Using both labeled and unlabeled data to train a model
D. Training multiple models independently

Solution

  1. Step 1: Understand the data types in semi-supervised learning

    Semi-supervised learning uses a mix of labeled and unlabeled data to improve model training.
  2. Step 2: Compare options with the definition

    Using both labeled and unlabeled data to train a model correctly states the use of both labeled and unlabeled data, unlike other options which mention only one type or unrelated concepts.
  3. Final Answer:

    Using both labeled and unlabeled data to train a model -> Option C
  4. Quick Check:

    Semi-supervised learning = labeled + unlabeled data [OK]
Hint: Remember: semi-supervised = mix of labeled and unlabeled [OK]
Common Mistakes:
  • Confusing semi-supervised with supervised learning
  • Thinking it uses only unlabeled data
  • Assuming it trains multiple models separately
2. Which of the following is a common method used in semi-supervised learning?
easy
A. Self-training
B. Gradient boosting
C. K-means clustering
D. Decision trees

Solution

  1. Step 1: Identify methods specific to semi-supervised learning

    Self-training is a popular semi-supervised method where the model labels unlabeled data iteratively.
  2. Step 2: Eliminate unrelated methods

    Gradient boosting and decision trees are supervised learning methods; K-means is unsupervised clustering, not semi-supervised.
  3. Final Answer:

    Self-training -> Option A
  4. Quick Check:

    Semi-supervised method = Self-training [OK]
Hint: Look for methods that use model to label unlabeled data [OK]
Common Mistakes:
  • Confusing supervised methods as semi-supervised
  • Choosing clustering as semi-supervised
  • Not knowing self-training meaning
3. Consider this Python snippet using label spreading for semi-supervised learning:
from sklearn.semi_supervised import LabelSpreading
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 1, -1, -1, -1])  # -1 means unlabeled

model = LabelSpreading()
model.fit(X, y)
preds = model.transduction_
print(preds)
What will be the output printed by print(preds)?
medium
A. [0 1 0 0 0]
B. [1 1 1 1 1]
C. [0 1 -1 -1 -1]
D. [0 1 1 1 1]

Solution

  1. Step 1: Understand label spreading behavior

    Label spreading propagates labels from labeled points (0 and 1) to unlabeled points (-1) based on similarity.
  2. Step 2: Predict labels for unlabeled points

    Since points 2,3,4 are close to labeled point 1, they get label 1. Points 0 and 1 keep their labels 0 and 1.
  3. Final Answer:

    [0 1 1 1 1] -> Option D
  4. Quick Check:

    Label spreading fills unlabeled with nearest labels [OK]
Hint: Label spreading fills unlabeled with nearest known labels [OK]
Common Mistakes:
  • Assuming unlabeled points remain -1
  • Thinking labels spread to 0 instead of 1
  • Confusing output with input labels
4. The following code attempts to use self-training but has an error:
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC

X = [[1], [2], [3], [4]]
y = [0, 1, -1, -1]

base_model = SVC()
model = SelfTrainingClassifier(base_model)
model.fit(X, y)
What is the error in this code?
medium
A. Labels cannot contain -1 for unlabeled data
B. SVC requires probability=True for self-training
C. X must be a numpy array, not a list
D. SelfTrainingClassifier cannot use SVC as base model

Solution

  1. Step 1: Check requirements for SelfTrainingClassifier base model

    SelfTrainingClassifier needs base model to provide probability estimates, so SVC must be initialized with probability=True.
  2. Step 2: Identify the missing argument

    The code uses default SVC without probability=True, causing an error during fit.
  3. Final Answer:

    SVC requires probability=True for self-training -> Option B
  4. Quick Check:

    SelfTrainingClassifier needs probabilistic base model [OK]
Hint: Remember: SVC needs probability=True for self-training [OK]
Common Mistakes:
  • Thinking -1 labels are invalid
  • Believing lists can't be used as input
  • Assuming SVC can't be base model
5. You have a dataset with 1000 samples but only 50 are labeled. You want to improve model accuracy using semi-supervised learning. Which approach is best to start with?
hard
A. Use self-training with a base classifier that predicts labels on unlabeled data iteratively
B. Ignore unlabeled data and train only on 50 labeled samples
C. Use unsupervised clustering to label all data without any model
D. Label all 950 samples manually before training

Solution

  1. Step 1: Understand the problem with few labeled samples

    With only 50 labeled samples, training a model directly may not generalize well.
  2. Step 2: Choose a semi-supervised method to leverage unlabeled data

    Self-training uses the base classifier to label unlabeled data iteratively, improving learning without costly manual labeling.
  3. Final Answer:

    Use self-training with a base classifier that predicts labels on unlabeled data iteratively -> Option A
  4. Quick Check:

    Semi-supervised learning improves with self-training on unlabeled data [OK]
Hint: Start with self-training to use unlabeled data effectively [OK]
Common Mistakes:
  • Ignoring unlabeled data wastes valuable information
  • Assuming manual labeling is always feasible
  • Confusing clustering with semi-supervised learning