0
0
MLOpsdevops~5 mins

Training data pipeline automation in MLOps - Commands & Configuration

Choose your learning style9 modes available
Introduction
Training data pipeline automation helps you prepare and move data automatically for machine learning models. It saves time and avoids mistakes by running data tasks without manual work.
When you need to clean and transform raw data before training a model every day.
When you want to fetch new data from a database or API regularly for training.
When you want to split data into training and testing sets automatically.
When you want to track data versions and changes during model development.
When you want to run data preparation steps as part of a machine learning workflow.
Config File - train_data_pipeline.py
train_data_pipeline.py
import mlflow
import pandas as pd

def load_data():
    data = pd.read_csv('data/raw_data.csv')
    return data

def clean_data(df):
    df_clean = df.dropna()
    return df_clean

def split_data(df):
    train = df.sample(frac=0.8, random_state=42)
    test = df.drop(train.index)
    return train, test

def main():
    mlflow.start_run()
    data = load_data()
    mlflow.log_param('raw_data_rows', len(data))
    clean = clean_data(data)
    mlflow.log_param('clean_data_rows', len(clean))
    train, test = split_data(clean)
    train.to_csv('data/train.csv', index=False)
    test.to_csv('data/test.csv', index=False)
    mlflow.log_artifact('data/train.csv')
    mlflow.log_artifact('data/test.csv')
    mlflow.end_run()

if __name__ == '__main__':
    main()

This Python script automates the training data pipeline using mlflow and pandas.

  • load_data(): Reads raw CSV data.
  • clean_data(): Removes rows with missing values.
  • split_data(): Splits data into training and testing sets.
  • main(): Runs the pipeline, logs parameters and artifacts with MLflow for tracking.
Commands
Run the training data pipeline script to load, clean, split data, and log results with MLflow.
Terminal
python train_data_pipeline.py
Expected OutputExpected
2024/06/01 12:00:00 INFO mlflow.tracking.fluent: Experiment with name 'Default' does not exist. Creating a new experiment. 2024/06/01 12:00:00 INFO mlflow.tracking.fluent: Run with ID '1234567890abcdef' started. 2024/06/01 12:00:01 INFO mlflow.tracking.fluent: Run with ID '1234567890abcdef' ended.
Start the MLflow tracking UI to view logged parameters and artifacts from the pipeline run.
Terminal
mlflow ui
Expected OutputExpected
2024/06/01 12:00:05 INFO mlflow.server: Starting MLflow server at http://127.0.0.1:5000
--port 5000 - Specify the port where the MLflow UI will be accessible
Key Concept

If you remember nothing else from this pattern, remember: automating data preparation with scripts and tracking tools saves time and ensures consistent training data.

Code Example
MLOps
import mlflow
import pandas as pd

def load_data():
    data = pd.read_csv('data/raw_data.csv')
    return data

def clean_data(df):
    df_clean = df.dropna()
    return df_clean

def split_data(df):
    train = df.sample(frac=0.8, random_state=42)
    test = df.drop(train.index)
    return train, test

def main():
    mlflow.start_run()
    data = load_data()
    mlflow.log_param('raw_data_rows', len(data))
    clean = clean_data(data)
    mlflow.log_param('clean_data_rows', len(clean))
    train, test = split_data(clean)
    train.to_csv('data/train.csv', index=False)
    test.to_csv('data/test.csv', index=False)
    mlflow.log_artifact('data/train.csv')
    mlflow.log_artifact('data/test.csv')
    mlflow.end_run()

if __name__ == '__main__':
    main()
OutputSuccess
Common Mistakes
Not logging data artifacts or parameters in MLflow.
Without logging, you lose track of data versions and pipeline runs, making debugging and comparison hard.
Always use mlflow.log_param and mlflow.log_artifact to record important data and metadata.
Hardcoding file paths without using relative or configurable paths.
This causes the script to fail when run from different directories or environments.
Use relative paths or environment variables to specify data locations.
Summary
Create a Python script to automate loading, cleaning, and splitting training data.
Use MLflow to log parameters and data artifacts for tracking pipeline runs.
Run the script and use the MLflow UI to review data pipeline results and versions.