0
0
MLOpsdevops~10 mins

Data validation in CI pipeline in MLOps - Commands & Configuration

Choose your learning style9 modes available
Introduction
Data validation in a CI pipeline helps catch errors in data before it is used in machine learning models. It ensures the data meets quality standards automatically every time new data or code is added.
When you want to check if new data files have missing or unexpected values before training a model
When you want to automatically stop a pipeline if data quality is poor
When you want to track data quality metrics over time to detect data drift
When you want to enforce data schema rules in your automated tests
When you want to integrate data checks as part of your code review process
Commands
Install Great Expectations, a popular Python library for data validation used in CI pipelines.
Terminal
pip install great_expectations
Expected OutputExpected
Collecting great_expectations Downloading great_expectations-0.16.18-py3-none-any.whl (1.2 MB) Installing collected packages: great_expectations Successfully installed great_expectations-0.16.18
Initialize a Great Expectations project in the current directory to set up configuration and folders.
Terminal
great_expectations init
Expected OutputExpected
Great Expectations has been successfully initialized! Your new Great Expectations project is ready to use.
Run a Python script that loads data and validates it using Great Expectations in the CI pipeline.
Terminal
python validate_data.py
Expected OutputExpected
Validation succeeded: All data checks passed.
Key Concept

If you remember nothing else, remember: automate data quality checks in your CI pipeline to catch data issues early and prevent bad data from breaking your ML models.

Code Example
MLOps
import great_expectations as ge

def validate_data():
    # Load data as a Great Expectations dataset
    df = ge.read_csv('data/sample_data.csv')

    # Expect column 'age' to have no nulls
    result1 = df.expect_column_values_to_not_be_null('age')

    # Expect column 'salary' to be greater than 0
    result2 = df.expect_column_values_to_be_between('salary', min_value=1)

    # Check if all expectations passed
    if result1.success and result2.success:
        print('Validation succeeded: All data checks passed.')
    else:
        print('Validation failed: Data checks did not pass.')

if __name__ == '__main__':
    validate_data()
OutputSuccess
Common Mistakes
Not installing the data validation library before running validation scripts
The validation commands will fail because the required tools are missing.
Always install dependencies like Great Expectations before running validation commands.
Running validation without initializing the project configuration
Validation will fail because the project lacks necessary config files and folders.
Run 'great_expectations init' once to set up the project before validating data.
Ignoring validation failures in the CI pipeline
Bad data can silently pass through and cause model errors later.
Fail the CI build if data validation reports errors to enforce data quality.
Summary
Install Great Expectations to add data validation capabilities.
Initialize the Great Expectations project to create config files.
Run validation scripts in the CI pipeline to automatically check data quality.
Fail the pipeline if data validation fails to prevent bad data usage.