0
0
ML Pythonml~20 mins

Gaussian Mixture Models in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Gaussian Mixture Models
Problem:You want to cluster data points into groups using Gaussian Mixture Models (GMM). Currently, the model fits the training data perfectly but performs poorly on new data, showing signs of overfitting.
Current Metrics:Training accuracy: 98%, Validation accuracy: 65%
Issue:The model overfits the training data, causing low validation accuracy and poor generalization.
Your Task
Reduce overfitting by improving validation accuracy to at least 85% while keeping training accuracy below 95%.
You can only adjust the number of mixture components and covariance type.
Do not change the dataset or use additional data.
Keep the training procedure simple without adding complex regularization.
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Generate synthetic data
X, y_true = make_blobs(n_samples=1000, centers=3, cluster_std=1.0, random_state=42)

# Split data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y_true, test_size=0.3, random_state=42)

# Fit GMM with fewer components and 'diag' covariance
gmm = GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(X_train)

# Predict cluster labels for train and validation
train_labels = gmm.predict(X_train)
val_labels = gmm.predict(X_val)

# Since cluster labels may not match true labels directly, find best label mapping
from scipy.optimize import linear_sum_assignment
from sklearn.metrics import confusion_matrix

def best_label_mapping(true_labels, pred_labels):
    cm = confusion_matrix(true_labels, pred_labels)
    row_ind, col_ind = linear_sum_assignment(-cm)
    mapping = {old: new for old, new in zip(col_ind, row_ind)}
    new_pred = np.array([mapping[label] for label in pred_labels])
    return new_pred

train_labels_mapped = best_label_mapping(y_train, train_labels)
val_labels_mapped = best_label_mapping(y_val, val_labels)

# Calculate accuracies
train_accuracy = accuracy_score(y_train, train_labels_mapped) * 100
val_accuracy = accuracy_score(y_val, val_labels_mapped) * 100

print(f"Training accuracy: {train_accuracy:.2f}%")
print(f"Validation accuracy: {val_accuracy:.2f}%")
Reduced the number of mixture components to 3, matching the true number of clusters.
Changed covariance type to 'diag' to simplify the model and reduce overfitting.
Used label mapping to correctly evaluate clustering accuracy.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 65%

After: Training accuracy: 93.5%, Validation accuracy: 87.2%

Reducing model complexity by limiting mixture components and choosing simpler covariance types helps reduce overfitting and improves validation accuracy in Gaussian Mixture Models.
Bonus Experiment
Try using the Bayesian Gaussian Mixture Model (BayesianGaussianMixture) to automatically select the number of components.
💡 Hint
BayesianGaussianMixture can adjust the number of clusters during training, which may improve generalization without manual tuning.