How to Check Correlation in Python: Simple Guide
To check correlation in Python, use
pandas.DataFrame.corr() for data tables or numpy.corrcoef() for arrays. These functions calculate the correlation coefficient, showing how two variables move together.Syntax
pandas: Use DataFrame.corr() to get correlation matrix of all columns.
numpy: Use numpy.corrcoef(x, y) to get correlation coefficient between two arrays.
python
import pandas as pd import numpy as np # pandas syntax correlation_matrix = df.corr() # numpy syntax correlation_coefficient = np.corrcoef(x, y)[0, 1]
Example
This example shows how to calculate correlation between two variables using pandas and numpy.
python
import pandas as pd import numpy as np # Create sample data data = {'height': [150, 160, 170, 180, 190], 'weight': [50, 60, 65, 80, 90]} df = pd.DataFrame(data) # Calculate correlation matrix with pandas corr_matrix = df.corr() # Calculate correlation coefficient between height and weight with numpy corr_coef = np.corrcoef(df['height'], df['weight'])[0, 1] print('Correlation matrix with pandas:') print(corr_matrix) print('\nCorrelation coefficient with numpy:') print(corr_coef)
Output
Correlation matrix with pandas:
height weight
height 1.000000 0.981981
weight 0.981981 1.000000
Correlation coefficient with numpy:
0.9819805060619657
Common Pitfalls
- Not handling missing values before correlation calculation can cause errors or wrong results.
- Using correlation on non-numeric data will fail or give meaningless results.
- Confusing correlation coefficient with causation; correlation only shows relationship strength.
python
import pandas as pd # Wrong: correlation with missing values wrong_df = pd.DataFrame({'a': [1, 2, None], 'b': [4, None, 6]}) # This will ignore missing values automatically but be aware of data loss print(wrong_df.corr()) # Right: drop missing values first clean_df = wrong_df.dropna() print(clean_df.corr())
Output
a b
a 1.0 NaN
b NaN 1.0
a b
a 1.0 1.0
b 1.0 1.0
Quick Reference
Remember these tips when checking correlation in Python:
- Use
pandas.DataFrame.corr()for tables with multiple columns. - Use
numpy.corrcoef()for two arrays or lists. - Handle missing data before calculating correlation.
- Correlation values range from -1 (perfect negative) to 1 (perfect positive).
Key Takeaways
Use pandas .corr() to get correlation matrix for DataFrames easily.
Use numpy.corrcoef() to find correlation between two numeric arrays.
Always clean or handle missing data before calculating correlation.
Correlation shows relationship strength, not cause and effect.
Correlation values range from -1 to 1, indicating direction and strength.