Bird
Raised Fist0
MLOpsdevops~5 mins

Data drift detection basics in MLOps - Time & Space Complexity

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Time Complexity: Data drift detection basics
O(f)
Understanding Time Complexity

When detecting data drift, we want to know how the time to check changes as data grows.

How does the cost of scanning data for drift grow with more data?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.


# Simple data drift detection by comparing feature distributions
for feature in dataset.features:
    baseline_dist = baseline_data[feature].distribution()
    current_dist = current_data[feature].distribution()
    drift_score = calculate_drift(baseline_dist, current_dist)
    if drift_score > threshold:
        alert_drift(feature)

This code checks each feature's distribution in new data against baseline data to find drift.

Identify Repeating Operations
  • Primary operation: Looping over each feature in the dataset.
  • How many times: Once per feature, so number of features (f).
How Execution Grows With Input

As the number of features grows, the time to check drift grows linearly.

Input Size (features)Approx. Operations
1010 drift checks
100100 drift checks
10001000 drift checks

Pattern observation: Doubling features doubles the work, so growth is steady and linear.

Final Time Complexity

Time Complexity: O(f)

This means the time to detect drift grows directly with the number of features checked.

Common Mistake

[X] Wrong: "Checking more features won't affect the time much because each check is fast."

[OK] Correct: Even if each check is quick, doing many checks adds up, so more features mean more total time.

Interview Connect

Understanding how time grows with data features helps you design scalable monitoring systems in real projects.

Self-Check

"What if we added nested loops to compare every feature pair? How would the time complexity change?"

Practice

(1/5)
1. What is the main purpose of data drift detection in machine learning?
easy
A. To check if new data differs significantly from the training data
B. To improve the speed of model training
C. To reduce the size of the training dataset
D. To increase the number of features in the model

Solution

  1. Step 1: Understand data drift concept

    Data drift detection is about monitoring if new incoming data changes compared to the data used to train the model.
  2. Step 2: Identify the purpose

    This helps ensure the model stays accurate by alerting when data changes too much.
  3. Final Answer:

    To check if new data differs significantly from the training data -> Option A
  4. Quick Check:

    Data drift = detecting data changes [OK]
Hint: Data drift means new data differs from old data [OK]
Common Mistakes:
  • Confusing data drift with model training speed
  • Thinking data drift reduces dataset size
  • Believing data drift adds features
2. Which of the following is a correct Python code snippet to check data drift using the Kolmogorov-Smirnov test on two datasets data_train and data_new?
easy
A. from scipy.stats import ks_test result = ks_test(data_train, data_new) print(result.pvalue)
B. from scipy.stats import ks_2samp result = ks_2samp(data_train, data_new) print(result.pvalue)
C. from sklearn.drift import ks_test result = ks_test(data_train, data_new) print(result.pvalue)
D. import stats result = stats.ks_test(data_train, data_new) print(result.pvalue)

Solution

  1. Step 1: Identify correct import and function

    The Kolmogorov-Smirnov test is in scipy.stats as ks_2samp.
  2. Step 2: Check function usage

    Calling ks_2samp(data_train, data_new) returns a result with pvalue attribute.
  3. Final Answer:

    from scipy.stats import ks_2samp result = ks_2samp(data_train, data_new) print(result.pvalue) -> Option B
  4. Quick Check:

    Correct function and import = from scipy.stats import ks_2samp result = ks_2samp(data_train, data_new) print(result.pvalue) [OK]
Hint: Use scipy.stats.ks_2samp for data drift test [OK]
Common Mistakes:
  • Using wrong module or function name
  • Trying to import non-existent ks_test
  • Confusing sklearn with scipy for this test
3. Given the following Python code to detect data drift, what will be the output if data_train = [1, 2, 3, 4, 5] and data_new = [1, 2, 3, 4, 10]?
from scipy.stats import ks_2samp
result = ks_2samp(data_train, data_new)
print(round(result.pvalue, 2))
medium
A. 0.87
B. 0.05
C. 0.01
D. 1.00

Solution

  1. Step 1: Understand the test and data

    The Kolmogorov-Smirnov test compares distributions. Here, only one value differs (5 vs 10).
  2. Step 2: Interpret p-value meaning

    A high p-value (close to 1) means no significant difference, low means drift detected.
  3. Final Answer:

    0.87 -> Option A
  4. Quick Check:

    Small difference gives high p-value = 0.87 [OK]
Hint: Small data changes give high p-value (no drift) [OK]
Common Mistakes:
  • Assuming any difference means low p-value
  • Confusing p-value with test statistic
  • Rounding errors in output
4. You wrote this code to detect data drift but get an error: AttributeError: module 'scipy.stats' has no attribute 'ks_test'. What is the fix?
import scipy.stats as stats
result = stats.ks_test(data_train, data_new)
print(result.pvalue)
medium
A. Use stats.kstest instead of ks_test
B. Import ks_test from scipy.stats explicitly
C. Change ks_test to ks_2samp in the code
D. Update scipy package to latest version

Solution

  1. Step 1: Identify the error cause

    The error says ks_test does not exist in scipy.stats.
  2. Step 2: Use correct function name

    The correct function for two-sample KS test is ks_2samp, not ks_test.
  3. Final Answer:

    Change ks_test to ks_2samp in the code -> Option C
  4. Quick Check:

    Function name must be ks_2samp [OK]
Hint: Use ks_2samp, not ks_test, for two-sample KS test [OK]
Common Mistakes:
  • Trying to import non-existent ks_test
  • Using one-sample test function by mistake
  • Ignoring error message details
5. You want to monitor data drift for multiple features in your dataset. Which approach best helps detect drift over time and alert you when it happens?
hard
A. Ignore data drift and focus on model accuracy metrics only
B. Retrain the model daily without checking data changes
C. Increase the model complexity to handle any data changes automatically
D. Run a statistical test like KS test on each feature periodically and trigger alerts if p-value is below threshold

Solution

  1. Step 1: Understand monitoring multiple features

    Checking each feature for drift helps catch changes in data distribution over time.
  2. Step 2: Use statistical tests and alerts

    Applying tests like KS test periodically and alerting on low p-values ensures timely detection.
  3. Final Answer:

    Run a statistical test like KS test on each feature periodically and trigger alerts if p-value is below threshold -> Option D
  4. Quick Check:

    Periodic tests + alerts = best drift monitoring [OK]
Hint: Test features regularly and alert on low p-values [OK]
Common Mistakes:
  • Retraining blindly without drift checks
  • Ignoring drift and trusting accuracy alone
  • Assuming complex models fix drift automatically