Bird
Raised Fist0
MLOpsdevops~10 mins

Data validation in CI pipeline in MLOps - Step-by-Step Execution

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Process Flow - Data validation in CI pipeline
Code Commit / Data Update
Trigger CI Pipeline
Run Data Validation Tests
Continue Pipeline
Deploy Model / Data
This flow shows how a data update triggers a CI pipeline that runs data validation tests. If tests pass, the pipeline continues; if they fail, it stops and alerts.
Execution Sample
MLOps
def validate_data(data):
    if data.isnull().sum().sum() > 0:
        return False
    if (data['age'] < 0).any():
        return False
    return True

result = validate_data(new_data)
This code checks if the data has missing values or negative ages and returns True if data is valid, False otherwise.
Process Table
StepCheckConditionResultAction
1Check missing valuesdata.isnull().sum().sum() > 0FalseContinue
2Check negative ages(data['age'] < 0).any()TrueFail validation
3Return validation resultN/AFalseStop pipeline and alert
💡 Validation failed due to negative age values, pipeline stops to prevent bad data deployment.
Status Tracker
VariableStartAfter Step 1After Step 2Final
data.isnull().sum().sum()N/A000
(data['age'] < 0).any()N/AN/ATrueTrue
resultN/AN/AN/AFalse
Key Moments - 2 Insights
Why does the pipeline stop even though there are no missing values?
Because the negative age check failed at step 2 (see execution_table row 2), which is critical for data quality, so the pipeline stops to prevent bad data deployment.
What happens if all checks pass?
If all checks pass (both conditions False), the validation returns True and the pipeline continues to deploy the model or data (not shown in this trace).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the result of the missing values check at step 1?
ATrue
BNone
CFalse
DError
💡 Hint
Refer to execution_table row 1, column 'Result' shows False for missing values check.
At which step does the pipeline decide to stop?
AStep 3
BStep 1
CStep 2
DPipeline never stops
💡 Hint
Look at execution_table row 3 where the validation result is False and action is to stop pipeline.
If the data had no negative ages, how would the 'result' variable change?
AIt would be False
BIt would be True
CIt would be None
DIt would cause an error
💡 Hint
Check variable_tracker for 'result' and consider if all conditions are False, validation returns True.
Concept Snapshot
Data validation in CI pipeline:
- Triggered by code or data update
- Runs checks like missing values and invalid data
- If validation passes, pipeline continues
- If validation fails, pipeline stops and alerts
- Prevents bad data/model deployment
Full Transcript
When new data or code is committed, the CI pipeline starts. It runs data validation tests step-by-step. First, it checks for missing values. If none are found, it checks for invalid data like negative ages. If any check fails, the pipeline stops and alerts the team to fix issues. If all checks pass, the pipeline continues to deploy the model or data safely. This process helps keep data quality high and prevents errors in production.

Practice

(1/5)
1. What is the main purpose of adding data validation in a CI pipeline for machine learning projects?
easy
A. To speed up the model training process
B. To catch data problems early before training models
C. To reduce the size of the dataset
D. To automatically deploy models to production

Solution

  1. Step 1: Understand the role of CI pipelines

    CI pipelines automate checks and tests to ensure quality before further steps.
  2. Step 2: Identify the purpose of data validation

    Data validation ensures data quality and format correctness to avoid errors in training.
  3. Final Answer:

    To catch data problems early before training models -> Option B
  4. Quick Check:

    Data validation = catch problems early [OK]
Hint: Data validation stops bad data early in pipeline [OK]
Common Mistakes:
  • Thinking validation speeds training
  • Confusing validation with deployment
  • Assuming validation reduces data size
2. Which of the following is the correct way to fail a CI pipeline step if a data validation script returns a non-zero exit code in a bash script?
easy
A. python validate_data.py || exit 1
B. python validate_data.py && exit 1
C. python validate_data.py; exit 0
D. python validate_data.py | exit 1

Solution

  1. Step 1: Understand bash exit codes and operators

    The '||' operator runs the command after it if the first command fails (non-zero exit).
  2. Step 2: Apply to data validation script

    If 'validate_data.py' fails, 'exit 1' stops the pipeline with error.
  3. Final Answer:

    python validate_data.py || exit 1 -> Option A
  4. Quick Check:

    Fail on error = '|| exit 1' [OK]
Hint: Use '|| exit 1' to fail on script error [OK]
Common Mistakes:
  • Using '&&' instead of '||' to fail
  • Using pipe '|' incorrectly
  • Exiting with 0 always
3. Given this Python snippet in a CI pipeline step:
import sys

def validate(data):
    if not data or len(data) < 5:
        return False
    return True

if __name__ == '__main__':
    data = sys.argv[1] if len(sys.argv) > 1 else ''
    if validate(data):
        print('Validation passed')
        sys.exit(0)
    else:
        print('Validation failed')
        sys.exit(1)
What will be the output and exit code if the pipeline runs python validate.py "abc"?
medium
A. Validation failed and exit code 0
B. Validation passed and exit code 0
C. Validation passed and exit code 1
D. Validation failed and exit code 1

Solution

  1. Step 1: Check input data length

    Input is 'abc' which length is 3, less than 5, so validate returns False.
  2. Step 2: Determine output and exit code

    Since validate returns False, it prints 'Validation failed' and exits with code 1.
  3. Final Answer:

    Validation failed and exit code 1 -> Option D
  4. Quick Check:

    Short data fails validation = A [OK]
Hint: Check input length to predict validation result [OK]
Common Mistakes:
  • Assuming any input passes
  • Confusing exit codes 0 and 1
  • Ignoring input length check
4. You have this YAML snippet in a CI pipeline to run data validation:
steps:
  - name: Validate Data
    run: |
      python validate.py data.csv
      echo "Data validation complete"
The pipeline does not fail even when validation.py returns exit code 1. What is the likely problem?
medium
A. The shell does not stop on errors by default; need 'set -e'
B. The 'if' condition is incorrect and never triggers
C. The 'exit 1' is inside the if but the script continues after
D. The validate.py script always returns 0

Solution

  1. Step 1: Understand shell error handling

    By default, shell scripts continue even if a command fails unless 'set -e' is used.
  2. Step 2: Apply to CI step

    Without 'set -e', the script continues after python fails, runs the echo which succeeds, so step exit code is 0.
  3. Final Answer:

    The shell does not stop on errors by default; need 'set -e' -> Option A
  4. Quick Check:

    Use 'set -e' to fail pipeline on errors [OK]
Hint: Add 'set -e' to stop on errors in shell scripts [OK]
Common Mistakes:
  • Assuming exit 1 always stops pipeline
  • Misreading if condition syntax
  • Ignoring shell default behavior
5. You want to add a data validation step in your CI pipeline that checks if a CSV file has no missing values and all numeric columns are within a specific range. Which approach best fits this requirement?
hard
A. Use a shell script with grep to search for empty fields and numeric ranges
B. Manually inspect the CSV file before running the pipeline
C. Write a Python script using pandas to check missing values and ranges, then fail with exit code 1 if invalid
D. Skip validation and rely on model training to catch errors

Solution

  1. Step 1: Identify tools for data validation

    Pandas in Python is ideal for checking missing values and numeric ranges programmatically.
  2. Step 2: Implement validation and fail pipeline

    Script should exit with code 1 if validation fails to stop the pipeline safely.
  3. Final Answer:

    Write a Python script using pandas to check missing values and ranges, then fail with exit code 1 if invalid -> Option C
  4. Quick Check:

    Use pandas script + exit 1 for robust validation [OK]
Hint: Use pandas for detailed CSV validation and fail on error [OK]
Common Mistakes:
  • Using grep which can't handle numeric ranges well
  • Relying on manual checks
  • Skipping validation entirely