Ml-pythonHow-ToBeginner · 4 min read

How to Detect Data Drift in Machine Learning Models

To detect data drift, compare the statistical properties of new incoming data with the original training data using tests like Kolmogorov-Smirnov or Chi-Square. Monitoring model input features and output predictions over time helps identify when data changes affect model performance.

📐

Syntax

Data drift detection typically involves these steps:

Collect baseline data: Original training data statistics.
Collect new data: Incoming data to compare.
Choose a statistical test: For example, Kolmogorov-Smirnov for continuous data or Chi-Square for categorical data.
Run the test: Compare distributions to detect significant differences.
Interpret results: A low p-value indicates data drift.

python

from scipy.stats import ks_2samp

# baseline_data and new_data are arrays of feature values
statistic, p_value = ks_2samp(baseline_data, new_data)

if p_value < 0.05:
    print('Data drift detected')
else:
    print('No data drift detected')

💻

Example

This example shows how to detect data drift on a numeric feature using the Kolmogorov-Smirnov test. It compares the original training data distribution with new incoming data.

python

import numpy as np
from scipy.stats import ks_2samp

# Simulate baseline training data
baseline_data = np.random.normal(loc=0, scale=1, size=1000)

# Simulate new data with drift (shifted mean)
new_data = np.random.normal(loc=0.5, scale=1, size=1000)

# Perform KS test
statistic, p_value = ks_2samp(baseline_data, new_data)

print(f"KS statistic: {statistic:.3f}")
print(f"p-value: {p_value:.3f}")

if p_value < 0.05:
    print("Data drift detected")
else:
    print("No data drift detected")

Output

KS statistic: 0.253 p-value: 0.000 Data drift detected

⚠️

Common Pitfalls

Ignoring feature types: Use appropriate tests for numeric vs categorical data.
Small sample sizes: Can lead to unreliable test results.
Not monitoring model outputs: Data drift may not always affect inputs but can impact predictions.
Overreacting to minor changes: Not all differences mean harmful drift; consider business context.

python

from scipy.stats import chi2_contingency

# Wrong: Using KS test on categorical data
categorical_baseline = ['red', 'blue', 'red', 'green']
categorical_new = ['red', 'red', 'blue', 'blue']

# This will error or give meaningless results
# Correct approach:
from collections import Counter

baseline_counts = Counter(categorical_baseline)
new_counts = Counter(categorical_new)

# Create contingency table
categories = list(set(baseline_counts) | set(new_counts))
contingency_table = [
    [baseline_counts.get(cat, 0), new_counts.get(cat, 0)]
    for cat in categories
]

chi2, p, dof, expected = chi2_contingency(contingency_table)

if p < 0.05:
    print('Data drift detected in categorical feature')
else:
    print('No data drift detected in categorical feature')

Output

Data drift detected in categorical feature

📊

Quick Reference

Tips for detecting data drift:

Use Kolmogorov-Smirnov test for continuous numeric features.
Use Chi-Square test for categorical features.
Monitor model prediction distributions over time.
Set thresholds for p-values (commonly 0.05) to flag drift.
Combine statistical tests with business knowledge for best results.

✅

Key Takeaways

Use statistical tests like Kolmogorov-Smirnov or Chi-Square to compare new data with training data.

Choose the right test based on feature type: numeric or categorical.

A low p-value (e.g., below 0.05) usually signals data drift.

Monitor both input features and model outputs to catch drift early.

Avoid false alarms by considering sample size and business impact.