How to Use Correlation Matrix for Features in Python
Use
pandas.DataFrame.corr() to compute the correlation matrix of features in Python. This matrix shows how strongly each feature relates to others, helping you identify redundant or important features for machine learning.Syntax
The correlation matrix is computed using DataFrame.corr() in pandas. It returns a table showing correlation coefficients between pairs of features.
df.corr(): Computes pairwise correlation of columns.- Returns a DataFrame where rows and columns are features.
- Values range from -1 (perfect negative) to 1 (perfect positive).
python
correlation_matrix = df.corr()
Example
This example shows how to load a dataset, compute the correlation matrix, and print it. It demonstrates how to identify feature relationships.
python
import pandas as pd from sklearn.datasets import load_iris # Load iris dataset as a pandas DataFrame iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) # Compute correlation matrix correlation_matrix = df.corr() # Print correlation matrix print(correlation_matrix)
Output
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
sepal length (cm) 1.000000 -0.117570 0.871754 0.817941
sepal width (cm) -0.117570 1.000000 -0.428440 -0.366126
petal length (cm) 0.871754 -0.428440 1.000000 0.962865
petal width (cm) 0.817941 -0.366126 0.962865 1.000000
Common Pitfalls
Common mistakes when using correlation matrices include:
- Confusing correlation with causation — correlation only shows relationships, not cause.
- Ignoring the scale of features — correlation works best with continuous numeric data.
- Using correlation on categorical features without encoding — this can give misleading results.
- Not handling missing values before computing correlation, which can cause errors or incorrect values.
python
import pandas as pd # Wrong: correlation on categorical data without encoding cat_df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue']}) try: print(cat_df.corr()) except Exception as e: print(f'Error: {e}') # Right: encode categorical data before correlation cat_df_encoded = pd.get_dummies(cat_df) print(cat_df_encoded.corr())
Output
Error: No numeric types to aggregate
blue green red
blue 1.0 0.0 0.0
green 0.0 1.0 0.0
red 0.0 0.0 1.0
Quick Reference
- Use:
df.corr()to get correlation matrix. - Interpret: Values near 1 or -1 show strong positive or negative correlation.
- Use for: Feature selection, detecting multicollinearity.
- Handle: Encode categorical features and clean missing data before correlation.
Key Takeaways
Use pandas DataFrame.corr() to compute feature correlation matrix easily.
Correlation values range from -1 to 1 indicating strength and direction of relationships.
Always encode categorical features before computing correlation to avoid errors.
Correlation matrix helps identify redundant features for better model performance.
Handle missing data before correlation to ensure accurate results.