Bird
Raised Fist0
MLOpsdevops~10 mins

Data drift detection basics in MLOps - Step-by-Step Execution

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Process Flow - Data drift detection basics
Collect baseline data
Train model on baseline
Collect new incoming data
Compare new data to baseline
Calculate drift metrics
Is drift above threshold?
NoContinue monitoring
Yes
Trigger alert or retrain model
This flow shows how data drift detection compares new data to baseline data, calculates metrics, and triggers alerts if drift is detected.
Execution Sample
MLOps
baseline_data = [10, 12, 11, 13, 12]
new_data = [10, 15, 11, 14, 20]
differences = [abs(n - b) for n, b in zip(new_data, baseline_data)]
drift = sum(differences) / len(baseline_data)
threshold = 3
alert = drift > threshold
print(alert)
This code calculates a simple average absolute difference between baseline and new data to detect drift and prints if alert is triggered.
Process Table
StepActionCalculationValueResult
1Calculate absolute differencesabs(10-10), abs(15-12), abs(11-11), abs(14-13), abs(20-12)[0, 3, 0, 1, 8]List of differences
2Sum differences0 + 3 + 0 + 1 + 812Total difference
3Calculate average difference12 / 52.4Drift metric
4Compare drift to threshold2.4 > 3FalseNo alert triggered
💡 Drift 2.4 is less than threshold 3, so no alert is triggered.
Status Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4
baseline_data[10,12,11,13,12][10,12,11,13,12][10,12,11,13,12][10,12,11,13,12][10,12,11,13,12]
new_data[10,15,11,14,20][10,15,11,14,20][10,15,11,14,20][10,15,11,14,20][10,15,11,14,20]
differencesN/A[0,3,0,1,8][0,3,0,1,8][0,3,0,1,8][0,3,0,1,8]
total_differenceN/AN/A121212
driftN/AN/AN/A2.42.4
threshold33333
alertN/AN/AN/AN/AFalse
Key Moments - 3 Insights
Why do we calculate the average difference instead of just the sum?
The average difference normalizes the drift metric by the number of data points, making it easier to compare across datasets of different sizes, as shown in step 3 of the execution_table.
What does it mean if the alert is False even though differences exist?
It means the total drift is not large enough to pass the threshold, so the system considers the data stable, as seen in step 4 where 2.4 is less than 3.
Why do we compare new data to baseline data?
Baseline data represents the original data distribution the model was trained on; comparing new data to it helps detect changes or drift, as shown in step 1 where differences are calculated.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 3, what is the drift value calculated?
A3
B12
C2.4
DFalse
💡 Hint
Check the 'Value' column at step 3 in the execution_table.
At which step does the system decide if an alert should be triggered?
AStep 2
BStep 4
CStep 1
DStep 3
💡 Hint
Look for the step where drift is compared to threshold in the execution_table.
If the threshold was lowered to 2, what would the alert value be at step 4?
ATrue
BFalse
C12
D2.4
💡 Hint
Compare drift 2.4 to new threshold 2 in step 4 of execution_table.
Concept Snapshot
Data drift detection compares new data to baseline data.
Calculate a drift metric (e.g., average absolute difference).
Set a threshold to decide if drift is significant.
If drift > threshold, trigger alert or retrain.
This helps keep ML models accurate over time.
Full Transcript
Data drift detection basics involve comparing new incoming data to the original baseline data used to train a model. We calculate a drift metric, such as the average absolute difference between the new and baseline data points. This metric is then compared to a set threshold. If the drift exceeds the threshold, it indicates that the data distribution has changed significantly, and an alert is triggered to notify that the model may need retraining. This process helps maintain model accuracy by detecting when the data environment changes.

Practice

(1/5)
1. What is the main purpose of data drift detection in machine learning?
easy
A. To check if new data differs significantly from the training data
B. To improve the speed of model training
C. To reduce the size of the training dataset
D. To increase the number of features in the model

Solution

  1. Step 1: Understand data drift concept

    Data drift detection is about monitoring if new incoming data changes compared to the data used to train the model.
  2. Step 2: Identify the purpose

    This helps ensure the model stays accurate by alerting when data changes too much.
  3. Final Answer:

    To check if new data differs significantly from the training data -> Option A
  4. Quick Check:

    Data drift = detecting data changes [OK]
Hint: Data drift means new data differs from old data [OK]
Common Mistakes:
  • Confusing data drift with model training speed
  • Thinking data drift reduces dataset size
  • Believing data drift adds features
2. Which of the following is a correct Python code snippet to check data drift using the Kolmogorov-Smirnov test on two datasets data_train and data_new?
easy
A. from scipy.stats import ks_test result = ks_test(data_train, data_new) print(result.pvalue)
B. from scipy.stats import ks_2samp result = ks_2samp(data_train, data_new) print(result.pvalue)
C. from sklearn.drift import ks_test result = ks_test(data_train, data_new) print(result.pvalue)
D. import stats result = stats.ks_test(data_train, data_new) print(result.pvalue)

Solution

  1. Step 1: Identify correct import and function

    The Kolmogorov-Smirnov test is in scipy.stats as ks_2samp.
  2. Step 2: Check function usage

    Calling ks_2samp(data_train, data_new) returns a result with pvalue attribute.
  3. Final Answer:

    from scipy.stats import ks_2samp result = ks_2samp(data_train, data_new) print(result.pvalue) -> Option B
  4. Quick Check:

    Correct function and import = from scipy.stats import ks_2samp result = ks_2samp(data_train, data_new) print(result.pvalue) [OK]
Hint: Use scipy.stats.ks_2samp for data drift test [OK]
Common Mistakes:
  • Using wrong module or function name
  • Trying to import non-existent ks_test
  • Confusing sklearn with scipy for this test
3. Given the following Python code to detect data drift, what will be the output if data_train = [1, 2, 3, 4, 5] and data_new = [1, 2, 3, 4, 10]?
from scipy.stats import ks_2samp
result = ks_2samp(data_train, data_new)
print(round(result.pvalue, 2))
medium
A. 0.87
B. 0.05
C. 0.01
D. 1.00

Solution

  1. Step 1: Understand the test and data

    The Kolmogorov-Smirnov test compares distributions. Here, only one value differs (5 vs 10).
  2. Step 2: Interpret p-value meaning

    A high p-value (close to 1) means no significant difference, low means drift detected.
  3. Final Answer:

    0.87 -> Option A
  4. Quick Check:

    Small difference gives high p-value = 0.87 [OK]
Hint: Small data changes give high p-value (no drift) [OK]
Common Mistakes:
  • Assuming any difference means low p-value
  • Confusing p-value with test statistic
  • Rounding errors in output
4. You wrote this code to detect data drift but get an error: AttributeError: module 'scipy.stats' has no attribute 'ks_test'. What is the fix?
import scipy.stats as stats
result = stats.ks_test(data_train, data_new)
print(result.pvalue)
medium
A. Use stats.kstest instead of ks_test
B. Import ks_test from scipy.stats explicitly
C. Change ks_test to ks_2samp in the code
D. Update scipy package to latest version

Solution

  1. Step 1: Identify the error cause

    The error says ks_test does not exist in scipy.stats.
  2. Step 2: Use correct function name

    The correct function for two-sample KS test is ks_2samp, not ks_test.
  3. Final Answer:

    Change ks_test to ks_2samp in the code -> Option C
  4. Quick Check:

    Function name must be ks_2samp [OK]
Hint: Use ks_2samp, not ks_test, for two-sample KS test [OK]
Common Mistakes:
  • Trying to import non-existent ks_test
  • Using one-sample test function by mistake
  • Ignoring error message details
5. You want to monitor data drift for multiple features in your dataset. Which approach best helps detect drift over time and alert you when it happens?
hard
A. Ignore data drift and focus on model accuracy metrics only
B. Retrain the model daily without checking data changes
C. Increase the model complexity to handle any data changes automatically
D. Run a statistical test like KS test on each feature periodically and trigger alerts if p-value is below threshold

Solution

  1. Step 1: Understand monitoring multiple features

    Checking each feature for drift helps catch changes in data distribution over time.
  2. Step 2: Use statistical tests and alerts

    Applying tests like KS test periodically and alerting on low p-values ensures timely detection.
  3. Final Answer:

    Run a statistical test like KS test on each feature periodically and trigger alerts if p-value is below threshold -> Option D
  4. Quick Check:

    Periodic tests + alerts = best drift monitoring [OK]
Hint: Test features regularly and alert on low p-values [OK]
Common Mistakes:
  • Retraining blindly without drift checks
  • Ignoring drift and trusting accuracy alone
  • Assuming complex models fix drift automatically