Bird
Raised Fist0
MLOpsdevops~5 mins

Data validation in CI pipeline in MLOps - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is the purpose of data validation in a CI pipeline?
Data validation in a CI pipeline ensures that the data used for machine learning is clean, correct, and meets quality standards before training or deployment.
Click to reveal answer
beginner
Name a common tool or library used for data validation in ML pipelines.
Great Expectations is a popular open-source tool used to create, manage, and run data validation tests in ML pipelines.
Click to reveal answer
beginner
What happens if data validation fails during a CI pipeline run?
If data validation fails, the CI pipeline stops the process to prevent bad data from moving forward, protecting model quality and system reliability.
Click to reveal answer
intermediate
Why is automating data validation important in CI pipelines?
Automation saves time, reduces human error, and ensures consistent checks every time new data is added or changed.
Click to reveal answer
beginner
Give an example of a simple data validation check in a CI pipeline.
Checking if a dataset has any missing values or if all required columns exist before training starts.
Click to reveal answer
What is the first step in data validation within a CI pipeline?
ACheck data quality and schema
BTrain the model
CDeploy the model
DRun unit tests on code
Which tool is commonly used for data validation in ML pipelines?
ADocker
BGreat Expectations
CKubernetes
DJenkins
What should happen if data validation fails in a CI pipeline?
APipeline stops and reports error
BPipeline continues to deploy model
CPipeline ignores the error
DPipeline deletes the data
Why automate data validation in CI pipelines?
ATo slow down deployment
BTo make data dirty
CTo skip testing
DTo save time and avoid mistakes
Which of these is a simple data validation check?
ATrain a deep learning model
BDeploy to production
CCheck for missing values
DWrite documentation
Explain why data validation is critical in a CI pipeline for machine learning projects.
Think about what happens if bad data reaches the model.
You got /4 concepts.
    Describe how you would implement a simple data validation step in a CI pipeline.
    Focus on checks before model training.
    You got /4 concepts.

      Practice

      (1/5)
      1. What is the main purpose of adding data validation in a CI pipeline for machine learning projects?
      easy
      A. To speed up the model training process
      B. To catch data problems early before training models
      C. To reduce the size of the dataset
      D. To automatically deploy models to production

      Solution

      1. Step 1: Understand the role of CI pipelines

        CI pipelines automate checks and tests to ensure quality before further steps.
      2. Step 2: Identify the purpose of data validation

        Data validation ensures data quality and format correctness to avoid errors in training.
      3. Final Answer:

        To catch data problems early before training models -> Option B
      4. Quick Check:

        Data validation = catch problems early [OK]
      Hint: Data validation stops bad data early in pipeline [OK]
      Common Mistakes:
      • Thinking validation speeds training
      • Confusing validation with deployment
      • Assuming validation reduces data size
      2. Which of the following is the correct way to fail a CI pipeline step if a data validation script returns a non-zero exit code in a bash script?
      easy
      A. python validate_data.py || exit 1
      B. python validate_data.py && exit 1
      C. python validate_data.py; exit 0
      D. python validate_data.py | exit 1

      Solution

      1. Step 1: Understand bash exit codes and operators

        The '||' operator runs the command after it if the first command fails (non-zero exit).
      2. Step 2: Apply to data validation script

        If 'validate_data.py' fails, 'exit 1' stops the pipeline with error.
      3. Final Answer:

        python validate_data.py || exit 1 -> Option A
      4. Quick Check:

        Fail on error = '|| exit 1' [OK]
      Hint: Use '|| exit 1' to fail on script error [OK]
      Common Mistakes:
      • Using '&&' instead of '||' to fail
      • Using pipe '|' incorrectly
      • Exiting with 0 always
      3. Given this Python snippet in a CI pipeline step:
      import sys
      
      def validate(data):
          if not data or len(data) < 5:
              return False
          return True
      
      if __name__ == '__main__':
          data = sys.argv[1] if len(sys.argv) > 1 else ''
          if validate(data):
              print('Validation passed')
              sys.exit(0)
          else:
              print('Validation failed')
              sys.exit(1)
      What will be the output and exit code if the pipeline runs python validate.py "abc"?
      medium
      A. Validation failed and exit code 0
      B. Validation passed and exit code 0
      C. Validation passed and exit code 1
      D. Validation failed and exit code 1

      Solution

      1. Step 1: Check input data length

        Input is 'abc' which length is 3, less than 5, so validate returns False.
      2. Step 2: Determine output and exit code

        Since validate returns False, it prints 'Validation failed' and exits with code 1.
      3. Final Answer:

        Validation failed and exit code 1 -> Option D
      4. Quick Check:

        Short data fails validation = A [OK]
      Hint: Check input length to predict validation result [OK]
      Common Mistakes:
      • Assuming any input passes
      • Confusing exit codes 0 and 1
      • Ignoring input length check
      4. You have this YAML snippet in a CI pipeline to run data validation:
      steps:
        - name: Validate Data
          run: |
            python validate.py data.csv
            echo "Data validation complete"
      
      The pipeline does not fail even when validation.py returns exit code 1. What is the likely problem?
      medium
      A. The shell does not stop on errors by default; need 'set -e'
      B. The 'if' condition is incorrect and never triggers
      C. The 'exit 1' is inside the if but the script continues after
      D. The validate.py script always returns 0

      Solution

      1. Step 1: Understand shell error handling

        By default, shell scripts continue even if a command fails unless 'set -e' is used.
      2. Step 2: Apply to CI step

        Without 'set -e', the script continues after python fails, runs the echo which succeeds, so step exit code is 0.
      3. Final Answer:

        The shell does not stop on errors by default; need 'set -e' -> Option A
      4. Quick Check:

        Use 'set -e' to fail pipeline on errors [OK]
      Hint: Add 'set -e' to stop on errors in shell scripts [OK]
      Common Mistakes:
      • Assuming exit 1 always stops pipeline
      • Misreading if condition syntax
      • Ignoring shell default behavior
      5. You want to add a data validation step in your CI pipeline that checks if a CSV file has no missing values and all numeric columns are within a specific range. Which approach best fits this requirement?
      hard
      A. Use a shell script with grep to search for empty fields and numeric ranges
      B. Manually inspect the CSV file before running the pipeline
      C. Write a Python script using pandas to check missing values and ranges, then fail with exit code 1 if invalid
      D. Skip validation and rely on model training to catch errors

      Solution

      1. Step 1: Identify tools for data validation

        Pandas in Python is ideal for checking missing values and numeric ranges programmatically.
      2. Step 2: Implement validation and fail pipeline

        Script should exit with code 1 if validation fails to stop the pipeline safely.
      3. Final Answer:

        Write a Python script using pandas to check missing values and ranges, then fail with exit code 1 if invalid -> Option C
      4. Quick Check:

        Use pandas script + exit 1 for robust validation [OK]
      Hint: Use pandas for detailed CSV validation and fail on error [OK]
      Common Mistakes:
      • Using grep which can't handle numeric ranges well
      • Relying on manual checks
      • Skipping validation entirely