How to Detect Data Drift in Machine Learning Models
To detect
data drift, compare the statistical properties of new incoming data with the original training data using tests like Kolmogorov-Smirnov or Chi-Square. Monitoring model input features and output predictions over time helps identify when data changes affect model performance.Syntax
Data drift detection typically involves these steps:
- Collect baseline data: Original training data statistics.
- Collect new data: Incoming data to compare.
- Choose a statistical test: For example,
Kolmogorov-Smirnovfor continuous data orChi-Squarefor categorical data. - Run the test: Compare distributions to detect significant differences.
- Interpret results: A low p-value indicates data drift.
python
from scipy.stats import ks_2samp # baseline_data and new_data are arrays of feature values statistic, p_value = ks_2samp(baseline_data, new_data) if p_value < 0.05: print('Data drift detected') else: print('No data drift detected')
Example
This example shows how to detect data drift on a numeric feature using the Kolmogorov-Smirnov test. It compares the original training data distribution with new incoming data.
python
import numpy as np from scipy.stats import ks_2samp # Simulate baseline training data baseline_data = np.random.normal(loc=0, scale=1, size=1000) # Simulate new data with drift (shifted mean) new_data = np.random.normal(loc=0.5, scale=1, size=1000) # Perform KS test statistic, p_value = ks_2samp(baseline_data, new_data) print(f"KS statistic: {statistic:.3f}") print(f"p-value: {p_value:.3f}") if p_value < 0.05: print("Data drift detected") else: print("No data drift detected")
Output
KS statistic: 0.253
p-value: 0.000
Data drift detected
Common Pitfalls
- Ignoring feature types: Use appropriate tests for numeric vs categorical data.
- Small sample sizes: Can lead to unreliable test results.
- Not monitoring model outputs: Data drift may not always affect inputs but can impact predictions.
- Overreacting to minor changes: Not all differences mean harmful drift; consider business context.
python
from scipy.stats import chi2_contingency # Wrong: Using KS test on categorical data categorical_baseline = ['red', 'blue', 'red', 'green'] categorical_new = ['red', 'red', 'blue', 'blue'] # This will error or give meaningless results # Correct approach: from collections import Counter baseline_counts = Counter(categorical_baseline) new_counts = Counter(categorical_new) # Create contingency table categories = list(set(baseline_counts) | set(new_counts)) contingency_table = [ [baseline_counts.get(cat, 0), new_counts.get(cat, 0)] for cat in categories ] chi2, p, dof, expected = chi2_contingency(contingency_table) if p < 0.05: print('Data drift detected in categorical feature') else: print('No data drift detected in categorical feature')
Output
Data drift detected in categorical feature
Quick Reference
Tips for detecting data drift:
- Use
Kolmogorov-Smirnovtest for continuous numeric features. - Use
Chi-Squaretest for categorical features. - Monitor model prediction distributions over time.
- Set thresholds for p-values (commonly 0.05) to flag drift.
- Combine statistical tests with business knowledge for best results.
Key Takeaways
Use statistical tests like Kolmogorov-Smirnov or Chi-Square to compare new data with training data.
Choose the right test based on feature type: numeric or categorical.
A low p-value (e.g., below 0.05) usually signals data drift.
Monitor both input features and model outputs to catch drift early.
Avoid false alarms by considering sample size and business impact.