Data Analysis Pythondata~10 mins

Chi-squared test in Data Analysis Python - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Chi-squared test

Start with observed data

↓

Calculate expected data

↓

Compute (Observed - Expected)^2 / Expected

↓

Sum all values to get Chi-squared statistic

↓

Compare statistic to Chi-squared distribution

↓

Decide if variables are independent or not

The test compares observed counts to expected counts to check if variables are independent.

Execution Sample

Data Analysis Python

import scipy.stats as stats
observed = [[10, 20], [20, 40]]
chi2, p, dof, expected = stats.chi2_contingency(observed)
print(chi2, p, dof)
print(expected)

This code runs a chi-squared test on a 2x2 table and prints the test statistic, p-value, degrees of freedom, and expected counts.

Execution Table

Step	Action	Calculation	Result
1	Input observed data	Observed = [[10, 20], [20, 40]]	Observed data set
2	Calculate row sums	Row sums = [30, 60]	30, 60
3	Calculate column sums	Column sums = [30, 60]	30, 60
4	Calculate total sum	Total = 90	90
5	Calculate expected counts	Expected[i,j] = (row_sum[i] * col_sum[j]) / total	[[10.0, 20.0], [20.0, 40.0]]
6	Calculate (O - E)^2 / E for each cell	For cell (0,0): (10-10)^2/10=0 For cell (0,1): (20-20)^2/20=0 For cell (1,0): (20-20)^2/20=0 For cell (1,1): (40-40)^2/40=0	[0,0,0,0]
7	Sum all values	Sum = 0 + 0 + 0 + 0	Chi-squared statistic = 0
8	Calculate p-value and degrees of freedom	dof = (rows-1)*(cols-1) = 1	p-value = 1.0, dof = 1
9	Interpret result	p-value > 0.05 means no evidence to reject independence	Variables likely independent

💡 Test ends after calculating p-value and interpreting result

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 4	After Step 5	After Step 6	Final
observed	[[10,20],[20,40]]	[[10,20],[20,40]]	[[10,20],[20,40]]	[[10,20],[20,40]]	[[10,20],[20,40]]	[[10,20],[20,40]]	[[10,20],[20,40]]
row_sums	N/A	[30,60]	[30,60]	[30,60]	[30,60]	[30,60]	[30,60]
col_sums	N/A	N/A	[30,60]	[30,60]	[30,60]	[30,60]	[30,60]
total	N/A	N/A	N/A	90	90	90	90
expected	N/A	N/A	N/A	N/A	[[10.0,20.0],[20.0,40.0]]	[[10.0,20.0],[20.0,40.0]]	[[10.0,20.0],[20.0,40.0]]
chi_squared_components	N/A	N/A	N/A	N/A	N/A	[0,0,0,0]	[0,0,0,0]
chi_squared_statistic	N/A	N/A	N/A	N/A	N/A	N/A	0
p_value	N/A	N/A	N/A	N/A	N/A	N/A	1.0
degrees_of_freedom	N/A	N/A	N/A	N/A	N/A	N/A	1

Key Moments - 3 Insights

Why do we calculate expected counts using row and column sums?

Why can the chi-squared statistic be zero?

What does a high p-value mean in this test?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at step 5. What is the expected count for cell (1,1)?

A40

B20

C30

D60

Concept Snapshot

Chi-squared test checks if two categorical variables are independent.
Input: observed counts in a contingency table.
Calculate expected counts assuming independence.
Compute sum of (Observed - Expected)^2 / Expected.
Compare statistic to chi-squared distribution for p-value.
Low p-value (<0.05) means variables likely related.

Full Transcript

The chi-squared test compares observed data counts to expected counts calculated assuming variables are independent. We start with observed data, calculate row and column sums, then total sum. Expected counts are found by multiplying row and column sums divided by total. For each cell, we compute (Observed - Expected)^2 divided by Expected, then sum these values to get the chi-squared statistic. We find the p-value using the chi-squared distribution with degrees of freedom based on table size. A high p-value means no evidence to reject independence. This process helps us decide if two categorical variables are related or not.