0
0
SciPydata~10 mins

Kolmogorov-Smirnov test in SciPy - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Kolmogorov-Smirnov test
Start with two samples or sample and distribution
Calculate empirical CDFs
Find max difference D between CDFs
Calculate p-value from D and sample sizes
Compare p-value to significance level
p > alpha
Fail to reject H0
The test compares two distributions by measuring the largest difference between their cumulative distributions, then decides if they are similar or not.
Execution Sample
SciPy
from scipy.stats import ks_2samp

sample1 = [1, 2, 3, 4, 5]
sample2 = [2, 3, 4, 5, 6]

result = ks_2samp(sample1, sample2)
print(result)
This code compares two small samples to see if they come from the same distribution using the Kolmogorov-Smirnov test.
Execution Table
StepActionValue/CalculationResult
1Input samplessample1=[1,2,3,4,5], sample2=[2,3,4,5,6]Samples ready
2Calculate empirical CDFsCDF1 and CDF2 arraysCDF1=[0.2,0.4,0.6,0.8,1.0], CDF2=[0.2,0.4,0.6,0.8,1.0]
3Find max difference Dmax|CDF1 - CDF2|D=0.2
4Calculate p-valuep based on D and sample sizesp=1.0
5Compare p-value to alpha=0.05p=1.0 > 0.05Fail to reject H0
6ConclusionSamples likely from same distributionTest result: statistic=0.2, pvalue=1.0
💡 p-value is greater than 0.05, so we fail to reject the null hypothesis that samples come from the same distribution
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4Final
sample1[1,2,3,4,5][1,2,3,4,5][1,2,3,4,5][1,2,3,4,5][1,2,3,4,5]
sample2[2,3,4,5,6][2,3,4,5,6][2,3,4,5,6][2,3,4,5,6][2,3,4,5,6]
CDF1N/A[0.2,0.4,0.6,0.8,1.0][0.2,0.4,0.6,0.8,1.0][0.2,0.4,0.6,0.8,1.0][0.2,0.4,0.6,0.8,1.0]
CDF2N/A[0.2,0.4,0.6,0.8,1.0][0.2,0.4,0.6,0.8,1.0][0.2,0.4,0.6,0.8,1.0][0.2,0.4,0.6,0.8,1.0]
DN/AN/A0.20.20.2
p-valueN/AN/AN/A1.01.0
Key Moments - 3 Insights
Why do we compare the p-value to 0.05?
0.05 is a common threshold (alpha) for significance; if p is greater, we say samples are similar (see execution_table step 5).
What does the D statistic represent?
D is the largest difference between the two cumulative distributions (execution_table step 3), showing how far apart samples are.
Why do we fail to reject the null hypothesis when p is high?
A high p means the observed difference could happen by chance, so we keep the assumption samples come from the same distribution (step 5).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table at step 3, what is the value of the D statistic?
A1.0
B0.05
C0.2
D0.8
💡 Hint
Check the 'Find max difference D' row in the execution_table.
At which step do we decide if the samples come from the same distribution?
AStep 2
BStep 5
CStep 3
DStep 1
💡 Hint
Look for the step comparing p-value to alpha in the execution_table.
If the p-value was 0.01 instead of 1.0, what would change in the conclusion?
AWe would reject H0
BD would be zero
CWe would fail to reject H0
DSamples would be identical
💡 Hint
Recall that p <= 0.05 means rejecting the null hypothesis (see concept_flow).
Concept Snapshot
Kolmogorov-Smirnov test compares two samples by their cumulative distributions.
Calculate the max difference D between CDFs.
Compute p-value from D and sample sizes.
If p > 0.05, samples likely come from the same distribution.
If p <= 0.05, samples differ significantly.
Full Transcript
The Kolmogorov-Smirnov test compares two samples or a sample and a distribution by calculating their empirical cumulative distribution functions (CDFs). It finds the largest difference D between these CDFs. Using D and the sample sizes, it calculates a p-value. This p-value tells us if the difference is significant. If p is greater than 0.05, we say the samples likely come from the same distribution and fail to reject the null hypothesis. If p is less or equal to 0.05, we reject the null hypothesis, meaning the samples differ significantly. The example code uses scipy's ks_2samp function to perform this test on two small samples, showing a D of 0.2 and a p-value of 1.0, so the test concludes the samples are similar.