Data drift detection basics in MLOps - Time & Space Complexity
Start learning this pattern below
Jump into concepts and practice - no test required
When detecting data drift, we want to know how the time to check changes as data grows.
How does the cost of scanning data for drift grow with more data?
Analyze the time complexity of the following code snippet.
# Simple data drift detection by comparing feature distributions
for feature in dataset.features:
baseline_dist = baseline_data[feature].distribution()
current_dist = current_data[feature].distribution()
drift_score = calculate_drift(baseline_dist, current_dist)
if drift_score > threshold:
alert_drift(feature)
This code checks each feature's distribution in new data against baseline data to find drift.
- Primary operation: Looping over each feature in the dataset.
- How many times: Once per feature, so number of features (f).
As the number of features grows, the time to check drift grows linearly.
| Input Size (features) | Approx. Operations |
|---|---|
| 10 | 10 drift checks |
| 100 | 100 drift checks |
| 1000 | 1000 drift checks |
Pattern observation: Doubling features doubles the work, so growth is steady and linear.
Time Complexity: O(f)
This means the time to detect drift grows directly with the number of features checked.
[X] Wrong: "Checking more features won't affect the time much because each check is fast."
[OK] Correct: Even if each check is quick, doing many checks adds up, so more features mean more total time.
Understanding how time grows with data features helps you design scalable monitoring systems in real projects.
"What if we added nested loops to compare every feature pair? How would the time complexity change?"
Practice
data drift detection in machine learning?Solution
Step 1: Understand data drift concept
Data drift detection is about monitoring if new incoming data changes compared to the data used to train the model.Step 2: Identify the purpose
This helps ensure the model stays accurate by alerting when data changes too much.Final Answer:
To check if new data differs significantly from the training data -> Option AQuick Check:
Data drift = detecting data changes [OK]
- Confusing data drift with model training speed
- Thinking data drift reduces dataset size
- Believing data drift adds features
data_train and data_new?Solution
Step 1: Identify correct import and function
The Kolmogorov-Smirnov test is inscipy.statsasks_2samp.Step 2: Check function usage
Callingks_2samp(data_train, data_new)returns a result withpvalueattribute.Final Answer:
from scipy.stats import ks_2samp result = ks_2samp(data_train, data_new) print(result.pvalue) -> Option BQuick Check:
Correct function and import = from scipy.stats import ks_2samp result = ks_2samp(data_train, data_new) print(result.pvalue) [OK]
- Using wrong module or function name
- Trying to import non-existent ks_test
- Confusing sklearn with scipy for this test
data_train = [1, 2, 3, 4, 5] and data_new = [1, 2, 3, 4, 10]?
from scipy.stats import ks_2samp result = ks_2samp(data_train, data_new) print(round(result.pvalue, 2))
Solution
Step 1: Understand the test and data
The Kolmogorov-Smirnov test compares distributions. Here, only one value differs (5 vs 10).Step 2: Interpret p-value meaning
A high p-value (close to 1) means no significant difference, low means drift detected.Final Answer:
0.87 -> Option AQuick Check:
Small difference gives high p-value = 0.87 [OK]
- Assuming any difference means low p-value
- Confusing p-value with test statistic
- Rounding errors in output
AttributeError: module 'scipy.stats' has no attribute 'ks_test'. What is the fix?
import scipy.stats as stats result = stats.ks_test(data_train, data_new) print(result.pvalue)
Solution
Step 1: Identify the error cause
The error saysks_testdoes not exist inscipy.stats.Step 2: Use correct function name
The correct function for two-sample KS test isks_2samp, notks_test.Final Answer:
Changeks_testtoks_2sampin the code -> Option CQuick Check:
Function name must be ks_2samp [OK]
- Trying to import non-existent ks_test
- Using one-sample test function by mistake
- Ignoring error message details
Solution
Step 1: Understand monitoring multiple features
Checking each feature for drift helps catch changes in data distribution over time.Step 2: Use statistical tests and alerts
Applying tests like KS test periodically and alerting on low p-values ensures timely detection.Final Answer:
Run a statistical test like KS test on each feature periodically and trigger alerts if p-value is below threshold -> Option DQuick Check:
Periodic tests + alerts = best drift monitoring [OK]
- Retraining blindly without drift checks
- Ignoring drift and trusting accuracy alone
- Assuming complex models fix drift automatically
