Overview - Kolmogorov-Smirnov test

What is it?

The Kolmogorov-Smirnov test is a way to check if two sets of numbers come from the same pattern or distribution. It compares the shapes of their data to see if they match closely or differ. This test works without assuming any specific type of distribution, making it flexible. It helps decide if two samples are similar or if one sample fits a known distribution.

Why it matters

Without this test, we might wrongly assume two data sets are alike or that data fits a certain pattern, leading to bad decisions. For example, in medicine, it helps check if a new treatment's results differ from usual outcomes. Without it, we could miss important differences or similarities, affecting research and real-world choices.

Where it fits

Before learning this, you should understand basic statistics like distributions and hypothesis testing. After this, you can explore other goodness-of-fit tests or advanced statistical comparisons. It fits in the journey after learning about data distributions and before deep statistical modeling.

Mental Model

Core Idea

The Kolmogorov-Smirnov test measures the biggest gap between two data patterns to decide if they come from the same source.

Think of it like...

Imagine two runners on a track starting together but running at different speeds. The test looks at the biggest distance between them during the race to see if they are running similarly or not.

Sample 1 CDF ──┐       ┌─────
               │       │     
               │       │     
Sample 2 CDF ──┼───────┼─────
               │       │     
               │       │     
               └───────┘     

D = max vertical distance between these two lines

Build-Up - 7 Steps

1

FoundationUnderstanding Data Distributions

Concept: Learn what a data distribution is and how data points spread in a sample.

A data distribution shows how often different values appear in a dataset. For example, test scores might cluster around a middle value or spread out evenly. Visual tools like histograms or cumulative distribution functions (CDFs) help us see this spread.

Result

You can describe how data values are arranged and recognize patterns like clustering or spread.

Understanding distributions is key because the Kolmogorov-Smirnov test compares these patterns between datasets.

2

FoundationWhat is a Cumulative Distribution Function

3

IntermediateComparing Two Samples with KS Test

4

IntermediateUsing KS Test for One Sample vs Distribution

5

IntermediateInterpreting KS Test Results

6

AdvancedLimitations and Assumptions of KS Test

7

ExpertKS Test in Multidimensional and Large Data

Under the Hood

The KS test calculates empirical CDFs for each sample by sorting data and computing cumulative probabilities. It then finds the maximum vertical distance (D) between these CDFs. Using the distribution of D under the null hypothesis, it computes a p-value. Internally, it relies on the Kolmogorov distribution to assess significance.

Why designed this way?

The test was designed to be nonparametric, meaning it does not assume a specific distribution shape, making it widely applicable. It focuses on the maximum difference to capture the largest deviation between samples, which is a simple yet powerful measure. Alternatives like chi-square require binning and lose information, so KS offers a more precise comparison.

Data Sample 1 ──> Sort ──> Empirical CDF 1
Data Sample 2 ──> Sort ──> Empirical CDF 2

Empirical CDF 1
│          ┌─────────────
│          │             
│          │             
│          │             
│          └─────────────
│
│          Empirical CDF 2
│          ┌─────┐       
│          │     │       
│          │     │       
│          └─────┘       
│
└─> Calculate max vertical distance D

D ──> Use Kolmogorov distribution to get p-value

Myth Busters - 4 Common Misconceptions

Quick: Does a high p-value prove two samples come from the same distribution? Commit to yes or no.

Common Belief:A high p-value means the two samples definitely come from the same distribution.

Tap to reveal reality

Quick: Can the KS test be used directly on categorical data? Commit to yes or no.

Common Belief:The KS test works on any type of data, including categories.

Tap to reveal reality

Quick: Does the KS test detect differences equally well at all parts of the distribution? Commit to yes or no.

Common Belief:The KS test is equally sensitive to differences anywhere in the distribution.

Tap to reveal reality

Quick: Can the KS test be applied directly to multidimensional data? Commit to yes or no.

Common Belief:The KS test can be used as is on multidimensional datasets.

Tap to reveal reality

Expert Zone

1

The KS test's sensitivity depends on sample size; very large samples can detect trivial differences that are not practically important.

2

When samples have ties (duplicate values), the KS test assumptions weaken, requiring careful interpretation or alternative tests.

3

The test statistic D is distribution-free under the null hypothesis, which means its distribution does not depend on the underlying continuous distribution.

When NOT to use

Avoid the KS test for discrete or categorical data, small sample sizes where power is low, or multidimensional data where it does not apply. Use alternatives like the Chi-square test for categorical data, Anderson-Darling test for more tail sensitivity, or multivariate tests like the Energy distance or MMD for multidimensional data.

Production Patterns

In practice, the KS test is used for validating simulation outputs, checking model residuals for normality, comparing experimental groups in A/B testing, and verifying assumptions before applying parametric tests. It is often combined with visual tools like Q-Q plots for robust analysis.

Connections

Hypothesis Testing

The KS test is a specific example of hypothesis testing focused on distribution comparison.

Understanding KS deepens grasp of hypothesis testing by showing how to test entire distributions, not just means or proportions.

Cumulative Distribution Function (CDF)

The KS test directly compares empirical CDFs of samples or against theoretical CDFs.

Mastering CDFs is essential to understand how KS measures differences between data patterns.

Quality Control in Manufacturing

KS test principles apply to checking if product measurements match expected standards.

Seeing KS as a tool for quality control shows its practical impact beyond statistics, ensuring products meet specifications.

Common Pitfalls

#1Using KS test on categorical data.

Wrong approach:from scipy.stats import ks_2samp sample1 = ['red', 'blue', 'red'] sample2 = ['blue', 'green', 'blue'] ks_2samp(sample1, sample2)

Correct approach:Use Chi-square test for categorical data: from scipy.stats import chi2_contingency contingency_table = [[2,1],[1,2]] chi2_contingency(contingency_table)

Root cause:Misunderstanding that KS test requires ordered or continuous data.

#2Interpreting a high p-value as proof of identical distributions.

Wrong approach:result = ks_2samp(data1, data2) if result.pvalue > 0.05: print('Samples are the same')

Correct approach:if result.pvalue > 0.05: print('No evidence to reject similarity, but not proof of identical distributions')

Root cause:Confusing 'fail to reject' with 'accept' in hypothesis testing.

#3Applying KS test directly to multidimensional data.

Wrong approach:ks_2samp(multidim_data1, multidim_data2)

Correct approach:Use multivariate tests like Energy distance or Maximum Mean Discrepancy (MMD) for multidimensional data.

Root cause:Not recognizing KS test's limitation to one-dimensional data.

Key Takeaways

The Kolmogorov-Smirnov test compares the maximum difference between cumulative distributions to assess similarity.

It is a nonparametric test that does not assume a specific distribution shape, making it flexible for many data types.

Interpreting the test requires understanding p-values as evidence strength, not proof of sameness.

The test works best with continuous, one-dimensional data and has limitations with small samples or multidimensional data.

Knowing when and how to use the KS test prevents common mistakes and supports better data-driven decisions.