MLOpsdevops~5 mins

Training data pipeline automation in MLOps - Commands & Configuration

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

Training data pipeline automation helps you prepare and move data automatically for machine learning models. It saves time and avoids mistakes by running data tasks without manual work.

When you need to clean and transform raw data before training a model every day.

When you want to fetch new data from a database or API regularly for training.

When you want to split data into training and testing sets automatically.

When you want to track data versions and changes during model development.

When you want to run data preparation steps as part of a machine learning workflow.

Config File - train_data_pipeline.py

train_data_pipeline.py

import mlflow
import pandas as pd

def load_data():
    data = pd.read_csv('data/raw_data.csv')
    return data

def clean_data(df):
    df_clean = df.dropna()
    return df_clean

def split_data(df):
    train = df.sample(frac=0.8, random_state=42)
    test = df.drop(train.index)
    return train, test

def main():
    mlflow.start_run()
    data = load_data()
    mlflow.log_param('raw_data_rows', len(data))
    clean = clean_data(data)
    mlflow.log_param('clean_data_rows', len(clean))
    train, test = split_data(clean)
    train.to_csv('data/train.csv', index=False)
    test.to_csv('data/test.csv', index=False)
    mlflow.log_artifact('data/train.csv')
    mlflow.log_artifact('data/test.csv')
    mlflow.end_run()

if __name__ == '__main__':
    main()

This Python script automates the training data pipeline using mlflow and pandas.

load_data(): Reads raw CSV data.
clean_data(): Removes rows with missing values.
split_data(): Splits data into training and testing sets.
main(): Runs the pipeline, logs parameters and artifacts with MLflow for tracking.

Commands

Run the training data pipeline script to load, clean, split data, and log results with MLflow.

Terminal

python train_data_pipeline.py

Expected OutputExpected

2024/06/01 12:00:00 INFO mlflow.tracking.fluent: Experiment with name 'Default' does not exist. Creating a new experiment. 2024/06/01 12:00:00 INFO mlflow.tracking.fluent: Run with ID '1234567890abcdef' started. 2024/06/01 12:00:01 INFO mlflow.tracking.fluent: Run with ID '1234567890abcdef' ended.

Start the MLflow tracking UI to view logged parameters and artifacts from the pipeline run.

Terminal

mlflow ui

Expected OutputExpected

2024/06/01 12:00:05 INFO mlflow.server: Starting MLflow server at http://127.0.0.1:5000

→

--port 5000 - Specify the port where the MLflow UI will be accessible

Key Concept

If you remember nothing else from this pattern, remember: automating data preparation with scripts and tracking tools saves time and ensures consistent training data.

Code Example

MLOps

import mlflow
import pandas as pd

def load_data():
    data = pd.read_csv('data/raw_data.csv')
    return data

def clean_data(df):
    df_clean = df.dropna()
    return df_clean

def split_data(df):
    train = df.sample(frac=0.8, random_state=42)
    test = df.drop(train.index)
    return train, test

def main():
    mlflow.start_run()
    data = load_data()
    mlflow.log_param('raw_data_rows', len(data))
    clean = clean_data(data)
    mlflow.log_param('clean_data_rows', len(clean))
    train, test = split_data(clean)
    train.to_csv('data/train.csv', index=False)
    test.to_csv('data/test.csv', index=False)
    mlflow.log_artifact('data/train.csv')
    mlflow.log_artifact('data/test.csv')
    mlflow.end_run()

if __name__ == '__main__':
    main()

OutputSuccess

Common Mistakes

Not logging data artifacts or parameters in MLflow.

Without logging, you lose track of data versions and pipeline runs, making debugging and comparison hard.

Always use mlflow.log_param and mlflow.log_artifact to record important data and metadata.

Hardcoding file paths without using relative or configurable paths.

This causes the script to fail when run from different directories or environments.

Use relative paths or environment variables to specify data locations.

Summary

Create a Python script to automate loading, cleaning, and splitting training data.

Use MLflow to log parameters and data artifacts for tracking pipeline runs.

Run the script and use the MLflow UI to review data pipeline results and versions.

Practice

(1/5)

1. What is the main benefit of automating a training data pipeline in machine learning?

easy

A. It saves time and reduces human errors during data preparation.

B. It makes the model training faster by using GPUs.

C. It increases the size of the training dataset automatically.

D. It guarantees 100% accuracy of the machine learning model.

Training data pipeline automation in MLOps - Commands & Configuration

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of automation in data pipelines

Step 2: Identify the key benefits of automation

Final Answer:

Quick Check:

Solution

Step 1: Identify correct Python function syntax

Step 2: Check indentation and syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Calculate mean and standard deviation of the sample

Step 2: Normalize each value and round to 2 decimals

Final Answer:

Quick Check:

Solution

Step 1: Understand the error message

Step 2: Fix by importing pandas with alias 'pd'

Final Answer:

Quick Check:

Solution

Step 1: Identify requirements for automation and monitoring

Step 2: Evaluate options for pipeline automation

Final Answer:

Quick Check: