Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Data drift detection basics
📖 Scenario: You work as a machine learning engineer. Your model uses data from sensors to predict equipment failures. Over time, the data can change, which may reduce model accuracy. This change is called data drift. Detecting data drift early helps keep the model reliable.
🎯 Goal: Build a simple Python script that detects data drift by comparing the distribution of new sensor data with the original training data.
📋 What You'll Learn
Create a dictionary called training_data with sensor readings as keys and their counts as values
Create a dictionary called new_data with sensor readings as keys and their counts as values
Create a variable called drift_threshold set to 0.2 (20%)
Calculate the total counts in training_data and new_data
Use a for loop with variables reading and count to iterate over training_data.items()
Calculate the proportion difference for each reading between training_data and new_data
Detect if any proportion difference exceeds drift_threshold
Print "Data drift detected" if drift is found, otherwise print "No data drift detected"
💡 Why This Matters
🌍 Real World
Detecting data drift helps maintain machine learning model accuracy by alerting engineers when input data changes significantly.
💼 Career
Data scientists and MLOps engineers use data drift detection to monitor models in production and trigger retraining or alerts.
Progress0 / 4 steps
1
Create the training data dictionary
Create a dictionary called training_data with these exact entries: "temp_high": 50, "temp_normal": 150, "temp_low": 30
MLOps
Hint
Use curly braces {} to create a dictionary with keys and values.
2
Create the new data dictionary and drift threshold
Create a dictionary called new_data with these exact entries: "temp_high": 100, "temp_normal": 90, "temp_low": 40. Then create a variable called drift_threshold and set it to 0.2
MLOps
Hint
Remember to use the exact variable names and values given.
3
Calculate proportions and detect drift
Calculate the total counts in training_data and new_data using sum(). Then use a for loop with variables reading and count to iterate over training_data.items(). Inside the loop, calculate the proportion of each reading in training_data and new_data. Check if the absolute difference between these proportions is greater than drift_threshold. If yes, set a variable drift_detected to True.
MLOps
Hint
Use new_data.get(reading, 0) to safely get counts from new_data.
4
Print the drift detection result
Write a print statement that prints "Data drift detected" if drift_detected is True. Otherwise, print "No data drift detected".
MLOps
Hint
Use an if statement to check drift_detected and print the correct message.
Practice
(1/5)
1. What is the main purpose of data drift detection in machine learning?
easy
A. To check if new data differs significantly from the training data
B. To improve the speed of model training
C. To reduce the size of the training dataset
D. To increase the number of features in the model
Solution
Step 1: Understand data drift concept
Data drift detection is about monitoring if new incoming data changes compared to the data used to train the model.
Step 2: Identify the purpose
This helps ensure the model stays accurate by alerting when data changes too much.
Final Answer:
To check if new data differs significantly from the training data -> Option A
Quick Check:
Data drift = detecting data changes [OK]
Hint: Data drift means new data differs from old data [OK]
Common Mistakes:
Confusing data drift with model training speed
Thinking data drift reduces dataset size
Believing data drift adds features
2. Which of the following is a correct Python code snippet to check data drift using the Kolmogorov-Smirnov test on two datasets data_train and data_new?
easy
A. from scipy.stats import ks_test
result = ks_test(data_train, data_new)
print(result.pvalue)
B. from scipy.stats import ks_2samp
result = ks_2samp(data_train, data_new)
print(result.pvalue)
C. from sklearn.drift import ks_test
result = ks_test(data_train, data_new)
print(result.pvalue)
D. import stats
result = stats.ks_test(data_train, data_new)
print(result.pvalue)
Solution
Step 1: Identify correct import and function
The Kolmogorov-Smirnov test is in scipy.stats as ks_2samp.
Step 2: Check function usage
Calling ks_2samp(data_train, data_new) returns a result with pvalue attribute.
Final Answer:
from scipy.stats import ks_2samp
result = ks_2samp(data_train, data_new)
print(result.pvalue) -> Option B
Quick Check:
Correct function and import = from scipy.stats import ks_2samp
result = ks_2samp(data_train, data_new)
print(result.pvalue) [OK]
Hint: Use scipy.stats.ks_2samp for data drift test [OK]
Common Mistakes:
Using wrong module or function name
Trying to import non-existent ks_test
Confusing sklearn with scipy for this test
3. Given the following Python code to detect data drift, what will be the output if data_train = [1, 2, 3, 4, 5] and data_new = [1, 2, 3, 4, 10]?
from scipy.stats import ks_2samp
result = ks_2samp(data_train, data_new)
print(round(result.pvalue, 2))
medium
A. 0.87
B. 0.05
C. 0.01
D. 1.00
Solution
Step 1: Understand the test and data
The Kolmogorov-Smirnov test compares distributions. Here, only one value differs (5 vs 10).
Step 2: Interpret p-value meaning
A high p-value (close to 1) means no significant difference, low means drift detected.
Final Answer:
0.87 -> Option A
Quick Check:
Small difference gives high p-value = 0.87 [OK]
Hint: Small data changes give high p-value (no drift) [OK]
Common Mistakes:
Assuming any difference means low p-value
Confusing p-value with test statistic
Rounding errors in output
4. You wrote this code to detect data drift but get an error: AttributeError: module 'scipy.stats' has no attribute 'ks_test'. What is the fix?
import scipy.stats as stats
result = stats.ks_test(data_train, data_new)
print(result.pvalue)
medium
A. Use stats.kstest instead of ks_test
B. Import ks_test from scipy.stats explicitly
C. Change ks_test to ks_2samp in the code
D. Update scipy package to latest version
Solution
Step 1: Identify the error cause
The error says ks_test does not exist in scipy.stats.
Step 2: Use correct function name
The correct function for two-sample KS test is ks_2samp, not ks_test.
Final Answer:
Change ks_test to ks_2samp in the code -> Option C
Quick Check:
Function name must be ks_2samp [OK]
Hint: Use ks_2samp, not ks_test, for two-sample KS test [OK]
Common Mistakes:
Trying to import non-existent ks_test
Using one-sample test function by mistake
Ignoring error message details
5. You want to monitor data drift for multiple features in your dataset. Which approach best helps detect drift over time and alert you when it happens?
hard
A. Ignore data drift and focus on model accuracy metrics only
B. Retrain the model daily without checking data changes
C. Increase the model complexity to handle any data changes automatically
D. Run a statistical test like KS test on each feature periodically and trigger alerts if p-value is below threshold
Solution
Step 1: Understand monitoring multiple features
Checking each feature for drift helps catch changes in data distribution over time.
Step 2: Use statistical tests and alerts
Applying tests like KS test periodically and alerting on low p-values ensures timely detection.
Final Answer:
Run a statistical test like KS test on each feature periodically and trigger alerts if p-value is below threshold -> Option D
Quick Check:
Periodic tests + alerts = best drift monitoring [OK]
Hint: Test features regularly and alert on low p-values [OK]