0
0
Data-analysis-pythonHow-ToBeginner ยท 3 min read

How to Check Correlation in Python: Simple Guide

To check correlation in Python, use pandas.DataFrame.corr() for data tables or numpy.corrcoef() for arrays. These functions calculate the correlation coefficient, showing how two variables move together.
๐Ÿ“

Syntax

pandas: Use DataFrame.corr() to get correlation matrix of all columns.

numpy: Use numpy.corrcoef(x, y) to get correlation coefficient between two arrays.

python
import pandas as pd
import numpy as np

# pandas syntax
correlation_matrix = df.corr()

# numpy syntax
correlation_coefficient = np.corrcoef(x, y)[0, 1]
๐Ÿ’ป

Example

This example shows how to calculate correlation between two variables using pandas and numpy.

python
import pandas as pd
import numpy as np

# Create sample data
data = {'height': [150, 160, 170, 180, 190],
        'weight': [50, 60, 65, 80, 90]}
df = pd.DataFrame(data)

# Calculate correlation matrix with pandas
corr_matrix = df.corr()

# Calculate correlation coefficient between height and weight with numpy
corr_coef = np.corrcoef(df['height'], df['weight'])[0, 1]

print('Correlation matrix with pandas:')
print(corr_matrix)
print('\nCorrelation coefficient with numpy:')
print(corr_coef)
Output
Correlation matrix with pandas: height weight height 1.000000 0.981981 weight 0.981981 1.000000 Correlation coefficient with numpy: 0.9819805060619657
โš ๏ธ

Common Pitfalls

  • Not handling missing values before correlation calculation can cause errors or wrong results.
  • Using correlation on non-numeric data will fail or give meaningless results.
  • Confusing correlation coefficient with causation; correlation only shows relationship strength.
python
import pandas as pd

# Wrong: correlation with missing values
wrong_df = pd.DataFrame({'a': [1, 2, None], 'b': [4, None, 6]})
# This will ignore missing values automatically but be aware of data loss
print(wrong_df.corr())

# Right: drop missing values first
clean_df = wrong_df.dropna()
print(clean_df.corr())
Output
a b a 1.0 NaN b NaN 1.0 a b a 1.0 1.0 b 1.0 1.0
๐Ÿ“Š

Quick Reference

Remember these tips when checking correlation in Python:

  • Use pandas.DataFrame.corr() for tables with multiple columns.
  • Use numpy.corrcoef() for two arrays or lists.
  • Handle missing data before calculating correlation.
  • Correlation values range from -1 (perfect negative) to 1 (perfect positive).
โœ…

Key Takeaways

Use pandas .corr() to get correlation matrix for DataFrames easily.
Use numpy.corrcoef() to find correlation between two numeric arrays.
Always clean or handle missing data before calculating correlation.
Correlation shows relationship strength, not cause and effect.
Correlation values range from -1 to 1, indicating direction and strength.