Introduction
Semi-supervised learning helps computers learn from a small amount of labeled data and a large amount of unlabeled data, making learning easier and cheaper.
Jump into concepts and practice - no test required
1. Start with a small labeled dataset and a large unlabeled dataset. 2. Train a model on the labeled data. 3. Use the model to predict labels for the unlabeled data. 4. Add the most confident predictions to the labeled set. 5. Retrain the model with the expanded labeled set. 6. Repeat steps 3-5 until satisfied.
# Example of pseudo-labeling steps
labeled_data = small_labeled_set
unlabeled_data = large_unlabeled_set
model.train(labeled_data)
predictions = model.predict(unlabeled_data)
confident_preds = select_confident(predictions)
labeled_data += confident_preds
model.train(labeled_data)# Using scikit-learn LabelSpreading from sklearn.semi_supervised import LabelSpreading model = LabelSpreading() model.fit(X_train, y_train_with_missing_labels) predicted_labels = model.transduction_
from sklearn import datasets from sklearn.semi_supervised import LabelSpreading from sklearn.metrics import accuracy_score import numpy as np # Load digits dataset digits = datasets.load_digits() X = digits.data y = digits.target # Remove labels for 50% of data to simulate unlabeled data rng = np.random.RandomState(42) random_unlabeled_points = rng.rand(len(y)) < 0.5 y_missing = np.copy(y) y_missing[random_unlabeled_points] = -1 # -1 means unlabeled # Create and train LabelSpreading model model = LabelSpreading(kernel='knn', alpha=0.8) model.fit(X, y_missing) # Predict labels for all data y_pred = model.transduction_ # Calculate accuracy only on originally labeled points accuracy = accuracy_score(y[~random_unlabeled_points], y_pred[~random_unlabeled_points]) print(f"Accuracy on labeled data: {accuracy:.2f}")
semi-supervised learning in machine learning?from sklearn.semi_supervised import LabelSpreading import numpy as np X = np.array([[1], [2], [3], [4], [5]]) y = np.array([0, 1, -1, -1, -1]) # -1 means unlabeled model = LabelSpreading() model.fit(X, y) preds = model.transduction_ print(preds)What will be the output printed by
print(preds)?from sklearn.semi_supervised import SelfTrainingClassifier from sklearn.svm import SVC X = [[1], [2], [3], [4]] y = [0, 1, -1, -1] base_model = SVC() model = SelfTrainingClassifier(base_model) model.fit(X, y)What is the error in this code?