MlopsHow-ToBeginner · 3 min read

How to Use Gaussian Mixture Model in Python with sklearn

Use GaussianMixture from sklearn.mixture to fit a Gaussian Mixture Model by creating an instance, calling fit() on your data, and then using predict() to get cluster labels or score_samples() for probabilities. This model helps to find groups in data assuming each group follows a Gaussian distribution.

📐

Syntax

The main class to use is GaussianMixture from sklearn.mixture. You create a model by specifying the number of components (clusters) and other optional parameters. Then, you fit the model to your data using fit(). After fitting, use predict() to assign cluster labels or score_samples() to get the likelihood of each point.

n_components: Number of Gaussian clusters to find.
covariance_type: Shape of covariance matrices ('full', 'tied', 'diag', 'spherical').
fit(X): Fits the model to data X.
predict(X): Predicts cluster labels for X.
score_samples(X): Returns log probabilities of each sample.

python

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, covariance_type='full')
gmm.fit(X)  # X is your data array
labels = gmm.predict(X)
probabilities = gmm.score_samples(X)

💻

Example

This example shows how to create synthetic data with 3 clusters, fit a Gaussian Mixture Model, and predict cluster labels.

python

import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

# Create synthetic data with 3 clusters
np.random.seed(0)
X1 = np.random.normal(0, 1, (100, 2))
X2 = np.random.normal(5, 1, (100, 2))
X3 = np.random.normal(-5, 1, (100, 2))
X = np.vstack([X1, X2, X3])

# Fit Gaussian Mixture Model
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=0)
gmm.fit(X)
labels = gmm.predict(X)

# Plot results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30)
plt.title('Gaussian Mixture Model Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Output

A scatter plot showing three distinct clusters colored differently based on GMM predicted labels.

⚠️

Common Pitfalls

Not scaling data can cause poor clustering because GMM assumes Gaussian distributions.
Choosing wrong n_components leads to underfitting or overfitting clusters.
Using predict() before fit() causes errors.
Ignoring covariance type can reduce model accuracy; 'full' is most flexible but slower.

python

from sklearn.mixture import GaussianMixture
import numpy as np

X = np.random.rand(100, 2)

gmm = GaussianMixture(n_components=2)

# Wrong: predict before fit
try:
    gmm.predict(X)
except Exception as e:
    print(f'Error: {e}')

# Right: fit before predict
gmm.fit(X)
labels = gmm.predict(X)
print('Labels:', labels[:5])

Output

Error: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator. Labels: [1 0 1 1 1]

📊

Quick Reference

Parameter / Method	Description
n_components	Number of Gaussian clusters to fit
covariance_type	Shape of covariance matrices: 'full', 'tied', 'diag', 'spherical'
fit(X)	Fit the model to data X
predict(X)	Predict cluster labels for data X
score_samples(X)	Log probability of each sample under the model
means_	Array of cluster centers after fitting
covariances_	Covariance matrices of clusters after fitting

✅

Key Takeaways

Use sklearn's GaussianMixture to model data as a mix of Gaussian clusters.

Always fit the model with fit() before predicting cluster labels.

Choose the number of components carefully to balance underfitting and overfitting.

Scaling data can improve model performance since GMM assumes Gaussian distributions.

Use 'full' covariance type for flexibility unless speed is critical.