Data drift detection basics in MLOps - Time & Space Complexity
When detecting data drift, we want to know how the time to check changes as data grows.
How does the cost of scanning data for drift grow with more data?
Analyze the time complexity of the following code snippet.
# Simple data drift detection by comparing feature distributions
for feature in dataset.features:
baseline_dist = baseline_data[feature].distribution()
current_dist = current_data[feature].distribution()
drift_score = calculate_drift(baseline_dist, current_dist)
if drift_score > threshold:
alert_drift(feature)
This code checks each feature's distribution in new data against baseline data to find drift.
- Primary operation: Looping over each feature in the dataset.
- How many times: Once per feature, so number of features (f).
As the number of features grows, the time to check drift grows linearly.
| Input Size (features) | Approx. Operations |
|---|---|
| 10 | 10 drift checks |
| 100 | 100 drift checks |
| 1000 | 1000 drift checks |
Pattern observation: Doubling features doubles the work, so growth is steady and linear.
Time Complexity: O(f)
This means the time to detect drift grows directly with the number of features checked.
[X] Wrong: "Checking more features won't affect the time much because each check is fast."
[OK] Correct: Even if each check is quick, doing many checks adds up, so more features mean more total time.
Understanding how time grows with data features helps you design scalable monitoring systems in real projects.
"What if we added nested loops to compare every feature pair? How would the time complexity change?"