Data drift detection in MLOps - Time & Space Complexity
Start learning this pattern below
Jump into concepts and practice - no test required
When detecting data drift, we want to know how the time to check changes as data grows.
We ask: How does the work increase when more data points arrive?
Analyze the time complexity of the following code snippet.
# Assume we have a batch of new data samples
new_data = load_new_data()
# Reference data summary stored
ref_summary = load_reference_summary()
# For each feature, compare distributions
for feature in new_data.features:
new_dist = calculate_distribution(new_data[feature])
drift_score = compare_distributions(new_dist, ref_summary[feature])
if drift_score > threshold:
alert_drift(feature)
This code checks each feature's data distribution against a stored reference to find if data drift happened.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over each feature to calculate and compare distributions.
- How many times: Once per feature in the dataset.
As the number of features grows, the time to check drift grows linearly.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 features | 10 distribution comparisons |
| 100 features | 100 distribution comparisons |
| 1000 features | 1000 distribution comparisons |
Pattern observation: Doubling features roughly doubles the work.
Time Complexity: O(n)
This means the time to detect drift grows directly with the number of features checked.
[X] Wrong: "Checking data drift takes the same time no matter how many features there are."
[OK] Correct: Each feature requires its own comparison, so more features mean more work.
Understanding how data drift detection scales helps you design efficient monitoring systems in real projects.
"What if we compared only a random sample of features instead of all? How would the time complexity change?"
Practice
data drift detection in MLOps?Solution
Step 1: Understand data drift concept
Data drift means the new data changes compared to the data used to train the model.Step 2: Identify the purpose of detection
Detecting data drift helps decide when to retrain or update the model to keep it accurate.Final Answer:
To check if new data differs significantly from the training data -> Option BQuick Check:
Data drift detection = check data difference [OK]
- Confusing data drift with model speed optimization
- Thinking data drift reduces dataset size
- Assuming data drift adds features
Solution
Step 1: Recall common MLOps tools
Evidently is a popular tool designed specifically for monitoring data and model drift.Step 2: Differentiate from other libraries
NumPy is for math, Matplotlib for plotting, Flask for web apps, not for drift detection.Final Answer:
Evidently -> Option DQuick Check:
Evidently = data drift detection tool [OK]
- Choosing NumPy or Matplotlib which are not for drift detection
- Confusing Flask as a data tool
report.run(reference_data, current_data) do?Solution
Step 1: Understand Evidently report usage
Therunmethod compares new data (current_data) against reference data to find differences.Step 2: Identify the purpose of the method
It does not train models, visualize architecture, or delete data; it detects data drift.Final Answer:
Compare current_data with reference_data to detect data drift -> Option CQuick Check:
report.run compares data for drift [OK]
- Thinking it trains a model
- Assuming it visualizes model structure
- Believing it deletes data
from evidently.dashboard import Dashboard dashboard = Dashboard(tabs=["data_drift"]) dashboard.run(current_data)What is the likely mistake?
Solution
Step 1: Check Dashboard.run() method requirements
Dashboard.run() requires both reference and current datasets to compare for drift.Step 2: Identify missing argument
Only current_data is passed; reference_data is missing, causing the error.Final Answer:
Missing reference data argument in dashboard.run() -> Option AQuick Check:
Dashboard.run needs reference and current data [OK]
- Assuming import is wrong
- Thinking data_drift tab is unsupported
- Believing variable name causes error
Solution
Step 1: Understand automation in MLOps
Automating retraining based on data drift ensures the model stays accurate without manual checks.Step 2: Identify best practice
Running daily drift detection and triggering retraining only when drift occurs is efficient and effective.Final Answer:
Set up a monitoring pipeline that runs data drift detection daily and triggers retraining if drift is found -> Option AQuick Check:
Automate retrain on drift detection = best practice [OK]
- Retraining blindly without checking data
- Relying on manual checks only
- Ignoring drift until accuracy drops
