Bird
Raised Fist0
ML Pythonml~20 mins

Gaussian Mixture Models in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Gaussian Mixture Models
Problem:You want to cluster data points into groups using Gaussian Mixture Models (GMM). Currently, the model fits the training data perfectly but performs poorly on new data, showing signs of overfitting.
Current Metrics:Training accuracy: 98%, Validation accuracy: 65%
Issue:The model overfits the training data, causing low validation accuracy and poor generalization.
Your Task
Reduce overfitting by improving validation accuracy to at least 85% while keeping training accuracy below 95%.
You can only adjust the number of mixture components and covariance type.
Do not change the dataset or use additional data.
Keep the training procedure simple without adding complex regularization.
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Generate synthetic data
X, y_true = make_blobs(n_samples=1000, centers=3, cluster_std=1.0, random_state=42)

# Split data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y_true, test_size=0.3, random_state=42)

# Fit GMM with fewer components and 'diag' covariance
gmm = GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(X_train)

# Predict cluster labels for train and validation
train_labels = gmm.predict(X_train)
val_labels = gmm.predict(X_val)

# Since cluster labels may not match true labels directly, find best label mapping
from scipy.optimize import linear_sum_assignment
from sklearn.metrics import confusion_matrix

def best_label_mapping(true_labels, pred_labels):
    cm = confusion_matrix(true_labels, pred_labels)
    row_ind, col_ind = linear_sum_assignment(-cm)
    mapping = {old: new for old, new in zip(col_ind, row_ind)}
    new_pred = np.array([mapping[label] for label in pred_labels])
    return new_pred

train_labels_mapped = best_label_mapping(y_train, train_labels)
val_labels_mapped = best_label_mapping(y_val, val_labels)

# Calculate accuracies
train_accuracy = accuracy_score(y_train, train_labels_mapped) * 100
val_accuracy = accuracy_score(y_val, val_labels_mapped) * 100

print(f"Training accuracy: {train_accuracy:.2f}%")
print(f"Validation accuracy: {val_accuracy:.2f}%")
Reduced the number of mixture components to 3, matching the true number of clusters.
Changed covariance type to 'diag' to simplify the model and reduce overfitting.
Used label mapping to correctly evaluate clustering accuracy.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 65%

After: Training accuracy: 93.5%, Validation accuracy: 87.2%

Reducing model complexity by limiting mixture components and choosing simpler covariance types helps reduce overfitting and improves validation accuracy in Gaussian Mixture Models.
Bonus Experiment
Try using the Bayesian Gaussian Mixture Model (BayesianGaussianMixture) to automatically select the number of components.
💡 Hint
BayesianGaussianMixture can adjust the number of clusters during training, which may improve generalization without manual tuning.

Practice

(1/5)
1. What is the main idea behind a Gaussian Mixture Model (GMM)?
easy
A. It assumes data is made of several bell-shaped groups mixed together.
B. It uses decision trees to split data into groups.
C. It finds the single best line to fit the data points.
D. It clusters data by measuring distances only.

Solution

  1. Step 1: Understand GMM concept

    GMM assumes data comes from multiple groups, each shaped like a bell curve (Gaussian).
  2. Step 2: Compare with other methods

    Unlike decision trees or distance-only methods, GMM models overlapping groups with probabilities.
  3. Final Answer:

    It assumes data is made of several bell-shaped groups mixed together. -> Option A
  4. Quick Check:

    GMM = mixture of Gaussians [OK]
Hint: Remember GMM = mix of bell curves for groups [OK]
Common Mistakes:
  • Confusing GMM with decision trees
  • Thinking GMM finds one line only
  • Assuming GMM uses only distances
2. Which Python library provides a built-in Gaussian Mixture Model class?
easy
A. matplotlib
B. pandas
C. scikit-learn
D. tensorflow

Solution

  1. Step 1: Identify libraries for ML models

    scikit-learn is a popular library with many ML models including GMM.
  2. Step 2: Check other libraries' purpose

    matplotlib is for plotting, pandas for data handling, tensorflow for deep learning, not GMM specifically.
  3. Final Answer:

    scikit-learn -> Option C
  4. Quick Check:

    GMM in scikit-learn [OK]
Hint: GMM class is in scikit-learn, not plotting or deep learning libs [OK]
Common Mistakes:
  • Choosing matplotlib for modeling
  • Confusing pandas with ML models
  • Picking tensorflow for GMM
3. What will the following Python code output?
from sklearn.mixture import GaussianMixture
import numpy as np
X = np.array([[1], [2], [3], [10], [11], [12]])
gmm = GaussianMixture(n_components=2, random_state=0)
gmm.fit(X)
labels = gmm.predict(X)
print(labels.tolist())
medium
A. [1, 0, 1, 0, 1, 0]
B. [0, 0, 0, 1, 1, 1]
C. [0, 1, 0, 1, 0, 1]
D. [1, 1, 1, 0, 0, 0]

Solution

  1. Step 1: Understand data and model

    Data has two clear groups: near 1-3 and near 10-12. GMM with 2 components fits these groups.
  2. Step 2: Predict labels

    GMM assigns first three points to one group (label 0) and last three to another (label 1).
  3. Final Answer:

    [0, 0, 0, 1, 1, 1] -> Option B
  4. Quick Check:

    Groups split as low and high values [OK]
Hint: GMM labels cluster points close together [OK]
Common Mistakes:
  • Mixing label order (0 vs 1)
  • Assuming alternating labels
  • Ignoring clear group separation
4. Identify the error in this GMM code snippet:
from sklearn.mixture import GaussianMixture
X = [[1, 2], [3, 4], [5, 6]]
gmm = GaussianMixture(n_components=2)
gmm.fit(X)
labels = gmm.predict(X)
print(labels)
medium
A. GaussianMixture requires a random_state parameter.
B. n_components must be 3 or more for this data.
C. fit() method should be called after predict().
D. X should be a NumPy array, not a list of lists.

Solution

  1. Step 1: Check data format for GMM

    GMM expects input as a NumPy array, not a plain Python list.
  2. Step 2: Verify other parameters and method order

    n_components=2 is valid, random_state is optional, fit() must be before predict().
  3. Final Answer:

    X should be a NumPy array, not a list of lists. -> Option D
  4. Quick Check:

    Input data type matters for GMM [OK]
Hint: Use NumPy arrays for GMM input data [OK]
Common Mistakes:
  • Passing lists instead of arrays
  • Wrong order of fit and predict
  • Thinking random_state is mandatory
5. You have a dataset with overlapping groups of different sizes and shapes. Which advantage of Gaussian Mixture Models makes them suitable here?
hard
A. They can model overlapping groups with different shapes using probabilities.
B. They always create groups of equal size.
C. They only work for groups that are perfectly separated.
D. They require groups to be circular and same size.

Solution

  1. Step 1: Understand group overlap and shape

    Real data groups often overlap and differ in shape and size.
  2. Step 2: Match GMM strengths

    GMM uses probabilities to model overlapping groups with different shapes, unlike simpler methods.
  3. Final Answer:

    They can model overlapping groups with different shapes using probabilities. -> Option A
  4. Quick Check:

    GMM handles overlap and shape variation [OK]
Hint: GMM models overlap and shape differences well [OK]
Common Mistakes:
  • Thinking GMM needs equal group sizes
  • Assuming groups must be separate
  • Believing GMM only fits circular groups