Bird
Raised Fist0
MLOpsdevops~10 mins

Data drift detection in MLOps - Step-by-Step Execution

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Process Flow - Data drift detection
Collect baseline data
Train model on baseline
Collect new incoming data
Compare new data to baseline
Calculate drift metrics
Is drift above threshold?
NoContinue monitoring
Yes
Trigger alert or retrain model
This flow shows how data drift detection compares new data to baseline data, calculates drift, and triggers alerts if drift is significant.
Execution Sample
MLOps
baseline = load_data('baseline.csv')
new_data = load_data('new.csv')
drift_score = calculate_drift(baseline, new_data)
if drift_score > 0.3:
    alert('Data drift detected')
This code loads baseline and new data, calculates a drift score, and alerts if the score exceeds 0.3.
Process Table
StepActionData ComparedDrift ScoreDecisionOutput
1Load baseline databaseline.csv-N/ABaseline data loaded
2Load new datanew.csv-N/ANew data loaded
3Calculate drift scorebaseline vs new0.250.25 <= 0.3No alert
4Calculate drift scorebaseline vs new0.350.35 > 0.3Alert triggered: Data drift detected
💡 Execution stops after alert triggered or no alert if drift score is below threshold
Status Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4
baselineNoneLoaded baseline.csv dataLoaded baseline.csv dataLoaded baseline.csv dataLoaded baseline.csv data
new_dataNoneNoneLoaded new.csv dataLoaded new.csv dataLoaded new.csv data
drift_scoreNoneNoneNone0.25 or 0.35 depending on data0.25 or 0.35 depending on data
Key Moments - 2 Insights
Why do we compare new data to baseline data instead of just looking at new data alone?
Because drift means the new data distribution changes compared to the baseline. The execution_table shows step 3 where the comparison happens to calculate drift_score.
What does the drift_score represent and why is there a threshold?
Drift_score quantifies how much new data differs from baseline. The threshold (0.3) in step 4 decides if the difference is significant enough to alert.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the drift_score at step 3?
A0.3
B0.25
C0.35
DNo score calculated
💡 Hint
Check the 'Drift Score' column in row with Step 3 in execution_table
At which step does the system decide to trigger an alert?
AStep 2
BStep 3
CStep 4
DNo alert triggered
💡 Hint
Look at the 'Decision' and 'Output' columns in execution_table rows
If the threshold was changed to 0.2, how would the alert behavior change?
AAlert triggers at drift_score 0.25
BAlert triggers only at drift_score 0.35
CNo alerts would trigger
DAlerts trigger at any drift_score
💡 Hint
Compare threshold logic in code and execution_table decisions
Concept Snapshot
Data drift detection compares new data to baseline data.
Calculate a drift score to measure difference.
If drift score > threshold, trigger alert or retrain.
Common threshold example: 0.3.
Helps keep ML models accurate over time.
Full Transcript
Data drift detection is a process where we first collect baseline data and train a model on it. Then, as new data comes in, we compare it to the baseline data to see if the data distribution has changed. We calculate a drift score to quantify this difference. If the drift score is above a set threshold, for example 0.3, we trigger an alert to notify that data drift has occurred. This helps us know when to retrain or adjust our machine learning models to keep them accurate. The execution steps include loading baseline and new data, calculating the drift score, and deciding whether to alert based on the threshold.

Practice

(1/5)
1. What is the main purpose of data drift detection in MLOps?
easy
A. To reduce the size of the dataset
B. To check if new data differs significantly from the training data
C. To improve the speed of model training
D. To increase the number of features in the model

Solution

  1. Step 1: Understand data drift concept

    Data drift means the new data changes compared to the data used to train the model.
  2. Step 2: Identify the purpose of detection

    Detecting data drift helps decide when to retrain or update the model to keep it accurate.
  3. Final Answer:

    To check if new data differs significantly from the training data -> Option B
  4. Quick Check:

    Data drift detection = check data difference [OK]
Hint: Data drift means new data changes from old data [OK]
Common Mistakes:
  • Confusing data drift with model speed optimization
  • Thinking data drift reduces dataset size
  • Assuming data drift adds features
2. Which Python library is commonly used for detecting data drift in MLOps?
easy
A. Flask
B. NumPy
C. Matplotlib
D. Evidently

Solution

  1. Step 1: Recall common MLOps tools

    Evidently is a popular tool designed specifically for monitoring data and model drift.
  2. Step 2: Differentiate from other libraries

    NumPy is for math, Matplotlib for plotting, Flask for web apps, not for drift detection.
  3. Final Answer:

    Evidently -> Option D
  4. Quick Check:

    Evidently = data drift detection tool [OK]
Hint: Evidently is made for data drift detection [OK]
Common Mistakes:
  • Choosing NumPy or Matplotlib which are not for drift detection
  • Confusing Flask as a data tool
3. Given the code snippet using Evidently, what will report.run(reference_data, current_data) do?
medium
A. Visualize the model architecture
B. Train a new model on current_data
C. Compare current_data with reference_data to detect data drift
D. Delete old data from the system

Solution

  1. Step 1: Understand Evidently report usage

    The run method compares new data (current_data) against reference data to find differences.
  2. Step 2: Identify the purpose of the method

    It does not train models, visualize architecture, or delete data; it detects data drift.
  3. Final Answer:

    Compare current_data with reference_data to detect data drift -> Option C
  4. Quick Check:

    report.run compares data for drift [OK]
Hint: report.run compares new vs reference data [OK]
Common Mistakes:
  • Thinking it trains a model
  • Assuming it visualizes model structure
  • Believing it deletes data
4. You wrote this code to detect data drift but get an error:
from evidently.dashboard import Dashboard
dashboard = Dashboard(tabs=["data_drift"])
dashboard.run(current_data)
What is the likely mistake?
medium
A. Missing reference data argument in dashboard.run()
B. Incorrect import statement for Dashboard
C. Dashboard does not support data_drift tab
D. current_data is not a valid variable name

Solution

  1. Step 1: Check Dashboard.run() method requirements

    Dashboard.run() requires both reference and current datasets to compare for drift.
  2. Step 2: Identify missing argument

    Only current_data is passed; reference_data is missing, causing the error.
  3. Final Answer:

    Missing reference data argument in dashboard.run() -> Option A
  4. Quick Check:

    Dashboard.run needs reference and current data [OK]
Hint: Dashboard.run needs two datasets: reference and current [OK]
Common Mistakes:
  • Assuming import is wrong
  • Thinking data_drift tab is unsupported
  • Believing variable name causes error
5. You want to automate retraining your model when data drift is detected. Which approach best fits this goal?
hard
A. Set up a monitoring pipeline that runs data drift detection daily and triggers retraining if drift is found
B. Retrain the model every week regardless of data changes
C. Manually check data drift reports and retrain when you have time
D. Ignore data drift and only retrain when model accuracy drops

Solution

  1. Step 1: Understand automation in MLOps

    Automating retraining based on data drift ensures the model stays accurate without manual checks.
  2. Step 2: Identify best practice

    Running daily drift detection and triggering retraining only when drift occurs is efficient and effective.
  3. Final Answer:

    Set up a monitoring pipeline that runs data drift detection daily and triggers retraining if drift is found -> Option A
  4. Quick Check:

    Automate retrain on drift detection = best practice [OK]
Hint: Automate retrain triggered by drift detection [OK]
Common Mistakes:
  • Retraining blindly without checking data
  • Relying on manual checks only
  • Ignoring drift until accuracy drops