Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Data validation in CI pipeline
📖 Scenario: You are working on a machine learning project where new data files are added regularly. To keep the model accurate, you want to check the data quality automatically before training. This means your Continuous Integration (CI) pipeline should validate the data files to catch errors early.
🎯 Goal: Build a simple Python script that validates a dataset in the CI pipeline by checking if all required columns exist and if numeric columns have no missing values. This will help ensure only clean data moves forward in the pipeline.
📋 What You'll Learn
Create a dictionary representing a dataset with specific columns and sample values
Add a list of required columns to check against the dataset
Write a loop to verify all required columns exist and numeric columns have no missing values
Print a message indicating if the data passed validation or not
💡 Why This Matters
🌍 Real World
In machine learning projects, data quality is crucial. Automating data validation in the CI pipeline helps catch errors before training models, saving time and improving reliability.
💼 Career
Data validation skills are important for MLOps engineers and data scientists to ensure clean data flows through automated pipelines, reducing bugs and improving model performance.
Progress0 / 4 steps
1
Create the dataset dictionary
Create a dictionary called dataset with these exact entries: 'id': [1, 2, 3], 'age': [25, 30, 22], 'income': [50000, 60000, 55000], 'name': ['Alice', 'Bob', 'Charlie']
MLOps
Hint
Use a dictionary with keys as column names and lists as values for each column.
2
Define required columns list
Create a list called required_columns with these exact values: 'id', 'age', 'income', 'name'
MLOps
Hint
Use a list with the exact column names as strings.
3
Validate dataset columns and missing values
Write a for loop using col to iterate over required_columns. Inside the loop, check if col is not in dataset keys or if any value in dataset[col] is None. If either is true, set a variable valid to False and break the loop. Otherwise, set valid to True before the loop.
MLOps
Hint
Use None in dataset[col] to check for missing values in the column list.
4
Print validation result
Write a print statement that outputs exactly "Data validation passed" if valid is True, otherwise print "Data validation failed".
MLOps
Hint
Use an if statement to check valid and print the exact messages.
Practice
(1/5)
1. What is the main purpose of adding data validation in a CI pipeline for machine learning projects?
easy
A. To speed up the model training process
B. To catch data problems early before training models
C. To reduce the size of the dataset
D. To automatically deploy models to production
Solution
Step 1: Understand the role of CI pipelines
CI pipelines automate checks and tests to ensure quality before further steps.
Step 2: Identify the purpose of data validation
Data validation ensures data quality and format correctness to avoid errors in training.
Final Answer:
To catch data problems early before training models -> Option B
Quick Check:
Data validation = catch problems early [OK]
Hint: Data validation stops bad data early in pipeline [OK]
Common Mistakes:
Thinking validation speeds training
Confusing validation with deployment
Assuming validation reduces data size
2. Which of the following is the correct way to fail a CI pipeline step if a data validation script returns a non-zero exit code in a bash script?
easy
A. python validate_data.py || exit 1
B. python validate_data.py && exit 1
C. python validate_data.py; exit 0
D. python validate_data.py | exit 1
Solution
Step 1: Understand bash exit codes and operators
The '||' operator runs the command after it if the first command fails (non-zero exit).
Step 2: Apply to data validation script
If 'validate_data.py' fails, 'exit 1' stops the pipeline with error.
Final Answer:
python validate_data.py || exit 1 -> Option A
Quick Check:
Fail on error = '|| exit 1' [OK]
Hint: Use '|| exit 1' to fail on script error [OK]
Common Mistakes:
Using '&&' instead of '||' to fail
Using pipe '|' incorrectly
Exiting with 0 always
3. Given this Python snippet in a CI pipeline step:
import sys
def validate(data):
if not data or len(data) < 5:
return False
return True
if __name__ == '__main__':
data = sys.argv[1] if len(sys.argv) > 1 else ''
if validate(data):
print('Validation passed')
sys.exit(0)
else:
print('Validation failed')
sys.exit(1)
What will be the output and exit code if the pipeline runs python validate.py "abc"?
medium
A. Validation failed and exit code 0
B. Validation passed and exit code 0
C. Validation passed and exit code 1
D. Validation failed and exit code 1
Solution
Step 1: Check input data length
Input is 'abc' which length is 3, less than 5, so validate returns False.
Step 2: Determine output and exit code
Since validate returns False, it prints 'Validation failed' and exits with code 1.
Final Answer:
Validation failed and exit code 1 -> Option D
Quick Check:
Short data fails validation = A [OK]
Hint: Check input length to predict validation result [OK]
Common Mistakes:
Assuming any input passes
Confusing exit codes 0 and 1
Ignoring input length check
4. You have this YAML snippet in a CI pipeline to run data validation:
The pipeline does not fail even when validation.py returns exit code 1. What is the likely problem?
medium
A. The shell does not stop on errors by default; need 'set -e'
B. The 'if' condition is incorrect and never triggers
C. The 'exit 1' is inside the if but the script continues after
D. The validate.py script always returns 0
Solution
Step 1: Understand shell error handling
By default, shell scripts continue even if a command fails unless 'set -e' is used.
Step 2: Apply to CI step
Without 'set -e', the script continues after python fails, runs the echo which succeeds, so step exit code is 0.
Final Answer:
The shell does not stop on errors by default; need 'set -e' -> Option A
Quick Check:
Use 'set -e' to fail pipeline on errors [OK]
Hint: Add 'set -e' to stop on errors in shell scripts [OK]
Common Mistakes:
Assuming exit 1 always stops pipeline
Misreading if condition syntax
Ignoring shell default behavior
5. You want to add a data validation step in your CI pipeline that checks if a CSV file has no missing values and all numeric columns are within a specific range. Which approach best fits this requirement?
hard
A. Use a shell script with grep to search for empty fields and numeric ranges
B. Manually inspect the CSV file before running the pipeline
C. Write a Python script using pandas to check missing values and ranges, then fail with exit code 1 if invalid
D. Skip validation and rely on model training to catch errors
Solution
Step 1: Identify tools for data validation
Pandas in Python is ideal for checking missing values and numeric ranges programmatically.
Step 2: Implement validation and fail pipeline
Script should exit with code 1 if validation fails to stop the pipeline safely.
Final Answer:
Write a Python script using pandas to check missing values and ranges, then fail with exit code 1 if invalid -> Option C
Quick Check:
Use pandas script + exit 1 for robust validation [OK]
Hint: Use pandas for detailed CSV validation and fail on error [OK]