MLOpsdevops~30 mins

Data validation in CI pipeline in MLOps - Mini Project: Build & Apply

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Data validation in CI pipeline

📖 Scenario: You are working on a machine learning project where new data files are added regularly. To keep the model accurate, you want to check the data quality automatically before training. This means your Continuous Integration (CI) pipeline should validate the data files to catch errors early.

🎯 Goal: Build a simple Python script that validates a dataset in the CI pipeline by checking if all required columns exist and if numeric columns have no missing values. This will help ensure only clean data moves forward in the pipeline.

📋 What You'll Learn

Create a dictionary representing a dataset with specific columns and sample values

Add a list of required columns to check against the dataset

Write a loop to verify all required columns exist and numeric columns have no missing values

Print a message indicating if the data passed validation or not

💡 Why This Matters

🌍 Real World

In machine learning projects, data quality is crucial. Automating data validation in the CI pipeline helps catch errors before training models, saving time and improving reliability.

💼 Career

Data validation skills are important for MLOps engineers and data scientists to ensure clean data flows through automated pipelines, reducing bugs and improving model performance.

Progress0 / 4 steps

Create the dataset dictionary

Create a dictionary called dataset with these exact entries: 'id': [1, 2, 3], 'age': [25, 30, 22], 'income': [50000, 60000, 55000], 'name': ['Alice', 'Bob', 'Charlie']

MLOps

# Create the dataset dictionary with exact keys and values
# Your code here

Hint

Use a dictionary with keys as column names and lists as values for each column.

Define required columns list

Create a list called required_columns with these exact values: 'id', 'age', 'income', 'name'

MLOps

dataset = {
    'id': [1, 2, 3],
    'age': [25, 30, 22],
    'income': [50000, 60000, 55000],
    'name': ['Alice', 'Bob', 'Charlie']
}
# Create the required_columns list with exact values
# Your code here

Hint

Use a list with the exact column names as strings.

Validate dataset columns and missing values

Write a for loop using col to iterate over required_columns. Inside the loop, check if col is not in dataset keys or if any value in dataset[col] is None. If either is true, set a variable valid to False and break the loop. Otherwise, set valid to True before the loop.

MLOps

dataset = {
    'id': [1, 2, 3],
    'age': [25, 30, 22],
    'income': [50000, 60000, 55000],
    'name': ['Alice', 'Bob', 'Charlie']
}
required_columns = ['id', 'age', 'income', 'name']

# Initialize valid as True
# Use a for loop with col to check each required column
# Check if col not in dataset or if None in dataset[col]
# Set valid to False and break if check fails
# Your code here

Hint

Use None in dataset[col] to check for missing values in the column list.

Print validation result

Write a print statement that outputs exactly "Data validation passed" if valid is True, otherwise print "Data validation failed".

MLOps

dataset = {
    'id': [1, 2, 3],
    'age': [25, 30, 22],
    'income': [50000, 60000, 55000],
    'name': ['Alice', 'Bob', 'Charlie']
}
required_columns = ['id', 'age', 'income', 'name']

valid = True
for col in required_columns:
    if col not in dataset or None in dataset[col]:
        valid = False
        break

# Print "Data validation passed" if valid is True, else print "Data validation failed"
# Your code here

Hint

Use an if statement to check valid and print the exact messages.

Practice

(1/5)

1. What is the main purpose of adding data validation in a CI pipeline for machine learning projects?

easy

A. To speed up the model training process

B. To catch data problems early before training models

C. To reduce the size of the dataset

D. To automatically deploy models to production

Data validation in CI pipeline in MLOps - Mini Project: Build & Apply

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of CI pipelines

Step 2: Identify the purpose of data validation

Final Answer:

Quick Check:

Solution

Step 1: Understand bash exit codes and operators

Step 2: Apply to data validation script

Final Answer:

Quick Check:

Solution

Step 1: Check input data length

Step 2: Determine output and exit code

Final Answer:

Quick Check:

Solution

Step 1: Understand shell error handling

Step 2: Apply to CI step

Final Answer:

Quick Check:

Solution

Step 1: Identify tools for data validation

Step 2: Implement validation and fail pipeline

Final Answer:

Quick Check: