0
0
MLOpsdevops~30 mins

Data validation in CI pipeline in MLOps - Mini Project: Build & Apply

Choose your learning style9 modes available
Data validation in CI pipeline
📖 Scenario: You are working on a machine learning project where new data files are added regularly. To keep the model accurate, you want to check the data quality automatically before training. This means your Continuous Integration (CI) pipeline should validate the data files to catch errors early.
🎯 Goal: Build a simple Python script that validates a dataset in the CI pipeline by checking if all required columns exist and if numeric columns have no missing values. This will help ensure only clean data moves forward in the pipeline.
📋 What You'll Learn
Create a dictionary representing a dataset with specific columns and sample values
Add a list of required columns to check against the dataset
Write a loop to verify all required columns exist and numeric columns have no missing values
Print a message indicating if the data passed validation or not
💡 Why This Matters
🌍 Real World
In machine learning projects, data quality is crucial. Automating data validation in the CI pipeline helps catch errors before training models, saving time and improving reliability.
💼 Career
Data validation skills are important for MLOps engineers and data scientists to ensure clean data flows through automated pipelines, reducing bugs and improving model performance.
Progress0 / 4 steps
1
Create the dataset dictionary
Create a dictionary called dataset with these exact entries: 'id': [1, 2, 3], 'age': [25, 30, 22], 'income': [50000, 60000, 55000], 'name': ['Alice', 'Bob', 'Charlie']
MLOps
Need a hint?

Use a dictionary with keys as column names and lists as values for each column.

2
Define required columns list
Create a list called required_columns with these exact values: 'id', 'age', 'income', 'name'
MLOps
Need a hint?

Use a list with the exact column names as strings.

3
Validate dataset columns and missing values
Write a for loop using col to iterate over required_columns. Inside the loop, check if col is not in dataset keys or if any value in dataset[col] is None. If either is true, set a variable valid to False and break the loop. Otherwise, set valid to True before the loop.
MLOps
Need a hint?

Use None in dataset[col] to check for missing values in the column list.

4
Print validation result
Write a print statement that outputs exactly "Data validation passed" if valid is True, otherwise print "Data validation failed".
MLOps
Need a hint?

Use an if statement to check valid and print the exact messages.