0
0
SciPydata~15 mins

Kolmogorov-Smirnov test in SciPy - Deep Dive

Choose your learning style9 modes available
Overview - Kolmogorov-Smirnov test
What is it?
The Kolmogorov-Smirnov test is a way to check if two sets of numbers come from the same pattern or distribution. It compares the shapes of their data to see if they match closely or differ. This test works without assuming any specific type of distribution, making it flexible. It helps decide if two samples are similar or if one sample fits a known distribution.
Why it matters
Without this test, we might wrongly assume two data sets are alike or that data fits a certain pattern, leading to bad decisions. For example, in medicine, it helps check if a new treatment's results differ from usual outcomes. Without it, we could miss important differences or similarities, affecting research and real-world choices.
Where it fits
Before learning this, you should understand basic statistics like distributions and hypothesis testing. After this, you can explore other goodness-of-fit tests or advanced statistical comparisons. It fits in the journey after learning about data distributions and before deep statistical modeling.
Mental Model
Core Idea
The Kolmogorov-Smirnov test measures the biggest gap between two data patterns to decide if they come from the same source.
Think of it like...
Imagine two runners on a track starting together but running at different speeds. The test looks at the biggest distance between them during the race to see if they are running similarly or not.
Sample 1 CDF ──┐       ┌─────
               │       │     
               │       │     
Sample 2 CDF ──┼───────┼─────
               │       │     
               │       │     
               └───────┘     

D = max vertical distance between these two lines
Build-Up - 7 Steps
1
FoundationUnderstanding Data Distributions
🤔
Concept: Learn what a data distribution is and how data points spread in a sample.
A data distribution shows how often different values appear in a dataset. For example, test scores might cluster around a middle value or spread out evenly. Visual tools like histograms or cumulative distribution functions (CDFs) help us see this spread.
Result
You can describe how data values are arranged and recognize patterns like clustering or spread.
Understanding distributions is key because the Kolmogorov-Smirnov test compares these patterns between datasets.
2
FoundationWhat is a Cumulative Distribution Function
🤔
Concept: Introduce the CDF as a way to summarize data distribution by showing cumulative probabilities.
A CDF tells us the chance that a data point is less than or equal to a certain value. It starts at 0 and rises to 1 as we move through the data range. Plotting the CDF gives a smooth curve representing the data's distribution.
Result
You can create and interpret CDFs to understand data spread in a cumulative way.
The Kolmogorov-Smirnov test uses CDFs to compare datasets, so knowing how to read them is essential.
3
IntermediateComparing Two Samples with KS Test
🤔Before reading on: do you think the KS test compares averages or the entire data shape? Commit to your answer.
Concept: The KS test compares the full shape of two data distributions, not just averages or medians.
The test calculates the maximum vertical distance (D) between the two samples' CDFs. It then uses this distance to decide if the samples likely come from the same distribution. A small D means they are similar; a large D means they differ.
Result
You get a test statistic (D) and a p-value indicating similarity between samples.
Knowing the test compares entire distributions helps avoid mistakes like focusing only on averages.
4
IntermediateUsing KS Test for One Sample vs Distribution
🤔Before reading on: can the KS test check if one sample fits a known distribution? Commit to yes or no.
Concept: The KS test can check if a sample matches a specific known distribution, like normal or uniform.
You compare the sample's CDF to the theoretical CDF of the known distribution. The test measures the biggest gap and calculates a p-value to accept or reject the fit.
Result
You learn if your data likely follows the chosen theoretical distribution.
This use helps validate assumptions about data, which is crucial before applying many statistical methods.
5
IntermediateInterpreting KS Test Results
🤔Before reading on: does a high p-value mean the samples are definitely from the same distribution? Commit to yes or no.
Concept: Understand how to read the test statistic and p-value to make decisions.
The test statistic D shows the biggest difference between CDFs. The p-value tells how likely it is to see such a difference if samples were from the same distribution. A low p-value (usually below 0.05) means the samples differ significantly.
Result
You can decide whether to accept or reject the hypothesis that samples come from the same distribution.
Knowing the meaning of p-values prevents wrong conclusions about data similarity.
6
AdvancedLimitations and Assumptions of KS Test
🤔Before reading on: does the KS test work well with small sample sizes? Commit to yes or no.
Concept: Learn when the KS test might give misleading results or fail to detect differences.
The KS test assumes continuous data and can be less powerful with small samples or discrete data. It is sensitive to differences near the center of distributions but less so at tails. Also, it requires independent samples.
Result
You understand when the test results might be unreliable or need caution.
Recognizing limitations helps choose the right test and interpret results correctly.
7
ExpertKS Test in Multidimensional and Large Data
🤔Before reading on: do you think the KS test directly applies to multidimensional data? Commit to yes or no.
Concept: Explore challenges and adaptations of the KS test beyond simple one-dimensional data.
The classic KS test works only for one-dimensional data. For multidimensional data, extensions or other tests are needed. Also, with very large datasets, small differences can become statistically significant but practically irrelevant, requiring careful interpretation.
Result
You gain awareness of the test's scope and how to handle complex data scenarios.
Understanding these nuances prevents misuse in advanced data science tasks.
Under the Hood
The KS test calculates empirical CDFs for each sample by sorting data and computing cumulative probabilities. It then finds the maximum vertical distance (D) between these CDFs. Using the distribution of D under the null hypothesis, it computes a p-value. Internally, it relies on the Kolmogorov distribution to assess significance.
Why designed this way?
The test was designed to be nonparametric, meaning it does not assume a specific distribution shape, making it widely applicable. It focuses on the maximum difference to capture the largest deviation between samples, which is a simple yet powerful measure. Alternatives like chi-square require binning and lose information, so KS offers a more precise comparison.
Data Sample 1 ──> Sort ──> Empirical CDF 1
Data Sample 2 ──> Sort ──> Empirical CDF 2

Empirical CDF 1
│          ┌─────────────
│          │             
│          │             
│          │             
│          └─────────────
│
│          Empirical CDF 2
│          ┌─────┐       
│          │     │       
│          │     │       
│          └─────┘       
│
└─> Calculate max vertical distance D

D ──> Use Kolmogorov distribution to get p-value
Myth Busters - 4 Common Misconceptions
Quick: Does a high p-value prove two samples come from the same distribution? Commit to yes or no.
Common Belief:A high p-value means the two samples definitely come from the same distribution.
Tap to reveal reality
Reality:A high p-value means there is not enough evidence to say they differ, but it does not prove they are the same.
Why it matters:Misinterpreting p-values can lead to false confidence and wrong conclusions about data similarity.
Quick: Can the KS test be used directly on categorical data? Commit to yes or no.
Common Belief:The KS test works on any type of data, including categories.
Tap to reveal reality
Reality:The KS test requires continuous or ordinal data because it compares cumulative distributions; it is not suitable for purely categorical data.
Why it matters:Using KS on categorical data leads to invalid results and misinformed decisions.
Quick: Does the KS test detect differences equally well at all parts of the distribution? Commit to yes or no.
Common Belief:The KS test is equally sensitive to differences anywhere in the distribution.
Tap to reveal reality
Reality:The KS test is most sensitive near the center of the distribution and less sensitive at the tails.
Why it matters:Important differences in the tails might be missed, affecting analyses where extremes matter.
Quick: Can the KS test be applied directly to multidimensional data? Commit to yes or no.
Common Belief:The KS test can be used as is on multidimensional datasets.
Tap to reveal reality
Reality:The KS test is designed for one-dimensional data; multidimensional data require other methods or adaptations.
Why it matters:Applying KS directly to multidimensional data can produce misleading results.
Expert Zone
1
The KS test's sensitivity depends on sample size; very large samples can detect trivial differences that are not practically important.
2
When samples have ties (duplicate values), the KS test assumptions weaken, requiring careful interpretation or alternative tests.
3
The test statistic D is distribution-free under the null hypothesis, which means its distribution does not depend on the underlying continuous distribution.
When NOT to use
Avoid the KS test for discrete or categorical data, small sample sizes where power is low, or multidimensional data where it does not apply. Use alternatives like the Chi-square test for categorical data, Anderson-Darling test for more tail sensitivity, or multivariate tests like the Energy distance or MMD for multidimensional data.
Production Patterns
In practice, the KS test is used for validating simulation outputs, checking model residuals for normality, comparing experimental groups in A/B testing, and verifying assumptions before applying parametric tests. It is often combined with visual tools like Q-Q plots for robust analysis.
Connections
Hypothesis Testing
The KS test is a specific example of hypothesis testing focused on distribution comparison.
Understanding KS deepens grasp of hypothesis testing by showing how to test entire distributions, not just means or proportions.
Cumulative Distribution Function (CDF)
The KS test directly compares empirical CDFs of samples or against theoretical CDFs.
Mastering CDFs is essential to understand how KS measures differences between data patterns.
Quality Control in Manufacturing
KS test principles apply to checking if product measurements match expected standards.
Seeing KS as a tool for quality control shows its practical impact beyond statistics, ensuring products meet specifications.
Common Pitfalls
#1Using KS test on categorical data.
Wrong approach:from scipy.stats import ks_2samp sample1 = ['red', 'blue', 'red'] sample2 = ['blue', 'green', 'blue'] ks_2samp(sample1, sample2)
Correct approach:Use Chi-square test for categorical data: from scipy.stats import chi2_contingency contingency_table = [[2,1],[1,2]] chi2_contingency(contingency_table)
Root cause:Misunderstanding that KS test requires ordered or continuous data.
#2Interpreting a high p-value as proof of identical distributions.
Wrong approach:result = ks_2samp(data1, data2) if result.pvalue > 0.05: print('Samples are the same')
Correct approach:if result.pvalue > 0.05: print('No evidence to reject similarity, but not proof of identical distributions')
Root cause:Confusing 'fail to reject' with 'accept' in hypothesis testing.
#3Applying KS test directly to multidimensional data.
Wrong approach:ks_2samp(multidim_data1, multidim_data2)
Correct approach:Use multivariate tests like Energy distance or Maximum Mean Discrepancy (MMD) for multidimensional data.
Root cause:Not recognizing KS test's limitation to one-dimensional data.
Key Takeaways
The Kolmogorov-Smirnov test compares the maximum difference between cumulative distributions to assess similarity.
It is a nonparametric test that does not assume a specific distribution shape, making it flexible for many data types.
Interpreting the test requires understanding p-values as evidence strength, not proof of sameness.
The test works best with continuous, one-dimensional data and has limitations with small samples or multidimensional data.
Knowing when and how to use the KS test prevents common mistakes and supports better data-driven decisions.