How to Use Gaussian Mixture Model in Python with sklearn
Use
GaussianMixture from sklearn.mixture to fit a Gaussian Mixture Model by creating an instance, calling fit() on your data, and then using predict() to get cluster labels or score_samples() for probabilities. This model helps to find groups in data assuming each group follows a Gaussian distribution.Syntax
The main class to use is GaussianMixture from sklearn.mixture. You create a model by specifying the number of components (clusters) and other optional parameters. Then, you fit the model to your data using fit(). After fitting, use predict() to assign cluster labels or score_samples() to get the likelihood of each point.
n_components: Number of Gaussian clusters to find.covariance_type: Shape of covariance matrices ('full', 'tied', 'diag', 'spherical').fit(X): Fits the model to dataX.predict(X): Predicts cluster labels forX.score_samples(X): Returns log probabilities of each sample.
python
from sklearn.mixture import GaussianMixture gmm = GaussianMixture(n_components=3, covariance_type='full') gmm.fit(X) # X is your data array labels = gmm.predict(X) probabilities = gmm.score_samples(X)
Example
This example shows how to create synthetic data with 3 clusters, fit a Gaussian Mixture Model, and predict cluster labels.
python
import numpy as np from sklearn.mixture import GaussianMixture import matplotlib.pyplot as plt # Create synthetic data with 3 clusters np.random.seed(0) X1 = np.random.normal(0, 1, (100, 2)) X2 = np.random.normal(5, 1, (100, 2)) X3 = np.random.normal(-5, 1, (100, 2)) X = np.vstack([X1, X2, X3]) # Fit Gaussian Mixture Model gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=0) gmm.fit(X) labels = gmm.predict(X) # Plot results plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30) plt.title('Gaussian Mixture Model Clustering') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.show()
Output
A scatter plot showing three distinct clusters colored differently based on GMM predicted labels.
Common Pitfalls
- Not scaling data can cause poor clustering because GMM assumes Gaussian distributions.
- Choosing wrong
n_componentsleads to underfitting or overfitting clusters. - Using
predict()beforefit()causes errors. - Ignoring covariance type can reduce model accuracy; 'full' is most flexible but slower.
python
from sklearn.mixture import GaussianMixture import numpy as np X = np.random.rand(100, 2) gmm = GaussianMixture(n_components=2) # Wrong: predict before fit try: gmm.predict(X) except Exception as e: print(f'Error: {e}') # Right: fit before predict gmm.fit(X) labels = gmm.predict(X) print('Labels:', labels[:5])
Output
Error: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Labels: [1 0 1 1 1]
Quick Reference
| Parameter / Method | Description |
|---|---|
| n_components | Number of Gaussian clusters to fit |
| covariance_type | Shape of covariance matrices: 'full', 'tied', 'diag', 'spherical' |
| fit(X) | Fit the model to data X |
| predict(X) | Predict cluster labels for data X |
| score_samples(X) | Log probability of each sample under the model |
| means_ | Array of cluster centers after fitting |
| covariances_ | Covariance matrices of clusters after fitting |
Key Takeaways
Use sklearn's GaussianMixture to model data as a mix of Gaussian clusters.
Always fit the model with fit() before predicting cluster labels.
Choose the number of components carefully to balance underfitting and overfitting.
Scaling data can improve model performance since GMM assumes Gaussian distributions.
Use 'full' covariance type for flexibility unless speed is critical.