Data drift detection in MLOps - Time & Space Complexity
When detecting data drift, we want to know how the time to check changes as data grows.
We ask: How does the work increase when more data points arrive?
Analyze the time complexity of the following code snippet.
# Assume we have a batch of new data samples
new_data = load_new_data()
# Reference data summary stored
ref_summary = load_reference_summary()
# For each feature, compare distributions
for feature in new_data.features:
new_dist = calculate_distribution(new_data[feature])
drift_score = compare_distributions(new_dist, ref_summary[feature])
if drift_score > threshold:
alert_drift(feature)
This code checks each feature's data distribution against a stored reference to find if data drift happened.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over each feature to calculate and compare distributions.
- How many times: Once per feature in the dataset.
As the number of features grows, the time to check drift grows linearly.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 features | 10 distribution comparisons |
| 100 features | 100 distribution comparisons |
| 1000 features | 1000 distribution comparisons |
Pattern observation: Doubling features roughly doubles the work.
Time Complexity: O(n)
This means the time to detect drift grows directly with the number of features checked.
[X] Wrong: "Checking data drift takes the same time no matter how many features there are."
[OK] Correct: Each feature requires its own comparison, so more features mean more work.
Understanding how data drift detection scales helps you design efficient monitoring systems in real projects.
"What if we compared only a random sample of features instead of all? How would the time complexity change?"