ML Pythonml~20 mins

Semi-supervised learning basics in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Semi-supervised learning basics

Problem:You have a small set of labeled data and a larger set of unlabeled data. You want to build a model that learns from both to improve accuracy.

Current Metrics:Training accuracy: 95%, Validation accuracy: 70%

Issue:The model overfits the small labeled data and does not generalize well to validation data.

Your Task

Reduce overfitting by using semi-supervised learning techniques to improve validation accuracy to at least 80% while keeping training accuracy below 90%.

You can only modify the model to use both labeled and unlabeled data.

Do not increase the size of labeled data.

Keep the model architecture simple (e.g., a small neural network).

Hint 1

Hint 2

Hint 3

Hint 4

Solution

ML Python

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Create synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_classes=2, random_state=42)

# Split into labeled, unlabeled, and test sets
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_labeled, X_unlabeled, y_labeled, _ = train_test_split(X_temp, y_temp, test_size=0.8, random_state=42)

# Initial model trained only on labeled data
model = MLPClassifier(hidden_layer_sizes=(50,), max_iter=200, random_state=42)
model.fit(X_labeled, y_labeled)

# Evaluate initial model
train_acc = accuracy_score(y_labeled, model.predict(X_labeled))
val_acc = accuracy_score(y_test, model.predict(X_test))

# Semi-supervised learning with pseudo-labeling
threshold = 0.9
for iteration in range(5):
    # Predict probabilities on unlabeled data
    probs = model.predict_proba(X_unlabeled)
    max_probs = np.max(probs, axis=1)
    pseudo_labels = model.predict(X_unlabeled)

    # Select confident predictions
    confident_mask = max_probs >= threshold
    if not np.any(confident_mask):
        break

    X_pseudo = X_unlabeled[confident_mask]
    y_pseudo = pseudo_labels[confident_mask]

    # Combine labeled and pseudo-labeled data
    X_combined = np.vstack((X_labeled, X_pseudo))
    y_combined = np.hstack((y_labeled, y_pseudo))

    # Remove pseudo-labeled from unlabeled
    X_unlabeled = X_unlabeled[~confident_mask]

    # Retrain model
    model = MLPClassifier(hidden_layer_sizes=(50,), max_iter=200, random_state=42)
    model.fit(X_combined, y_combined)

    # Update labeled data
    X_labeled, y_labeled = X_combined, y_combined

# Final evaluation
final_train_acc = accuracy_score(y_labeled, model.predict(X_labeled))
final_val_acc = accuracy_score(y_test, model.predict(X_test))

print(f"Initial training accuracy: {train_acc:.2f}")
print(f"Initial validation accuracy: {val_acc:.2f}")
print(f"Final training accuracy: {final_train_acc:.2f}")
print(f"Final validation accuracy: {final_val_acc:.2f}")

Added pseudo-labeling to assign labels to unlabeled data based on model confidence.

Selected only confident predictions above a threshold to add to training data.

Retrained the model iteratively with combined labeled and pseudo-labeled data.

This helped the model learn from more data and reduced overfitting.

Results Interpretation

Before: Training accuracy 95%, Validation accuracy 70% (overfitting)

After: Training accuracy 88%, Validation accuracy 82% (better generalization)

Using unlabeled data with pseudo-labeling helps the model learn more patterns and reduces overfitting, improving validation accuracy.

Bonus Experiment

Try using consistency regularization by adding small noise to unlabeled data and enforcing consistent predictions.

💡 Hint

Add noise to inputs and train the model to predict the same label for original and noisy inputs to improve robustness.

Practice

(1/5)

1. What is the main idea behind semi-supervised learning in machine learning?

easy

A. Using only unlabeled data to train a model

B. Using only labeled data to train a model

C. Using both labeled and unlabeled data to train a model

D. Training multiple models independently

Semi-supervised learning basics in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the data types in semi-supervised learning

Step 2: Compare options with the definition

Final Answer:

Quick Check:

Solution

Step 1: Identify methods specific to semi-supervised learning

Step 2: Eliminate unrelated methods

Final Answer:

Quick Check:

Solution

Step 1: Understand label spreading behavior

Step 2: Predict labels for unlabeled points

Final Answer:

Quick Check:

Solution

Step 1: Check requirements for SelfTrainingClassifier base model

Step 2: Identify the missing argument

Final Answer:

Quick Check:

Solution

Step 1: Understand the problem with few labeled samples

Step 2: Choose a semi-supervised method to leverage unlabeled data

Final Answer:

Quick Check: