Bird
Raised Fist0
MLOpsdevops~5 mins

Training data pipeline automation in MLOps - Commands & Configuration

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Training data pipeline automation helps you prepare and move data automatically for machine learning models. It saves time and avoids mistakes by running data tasks without manual work.
When you need to clean and transform raw data before training a model every day.
When you want to fetch new data from a database or API regularly for training.
When you want to split data into training and testing sets automatically.
When you want to track data versions and changes during model development.
When you want to run data preparation steps as part of a machine learning workflow.
Config File - train_data_pipeline.py
train_data_pipeline.py
import mlflow
import pandas as pd

def load_data():
    data = pd.read_csv('data/raw_data.csv')
    return data

def clean_data(df):
    df_clean = df.dropna()
    return df_clean

def split_data(df):
    train = df.sample(frac=0.8, random_state=42)
    test = df.drop(train.index)
    return train, test

def main():
    mlflow.start_run()
    data = load_data()
    mlflow.log_param('raw_data_rows', len(data))
    clean = clean_data(data)
    mlflow.log_param('clean_data_rows', len(clean))
    train, test = split_data(clean)
    train.to_csv('data/train.csv', index=False)
    test.to_csv('data/test.csv', index=False)
    mlflow.log_artifact('data/train.csv')
    mlflow.log_artifact('data/test.csv')
    mlflow.end_run()

if __name__ == '__main__':
    main()

This Python script automates the training data pipeline using mlflow and pandas.

  • load_data(): Reads raw CSV data.
  • clean_data(): Removes rows with missing values.
  • split_data(): Splits data into training and testing sets.
  • main(): Runs the pipeline, logs parameters and artifacts with MLflow for tracking.
Commands
Run the training data pipeline script to load, clean, split data, and log results with MLflow.
Terminal
python train_data_pipeline.py
Expected OutputExpected
2024/06/01 12:00:00 INFO mlflow.tracking.fluent: Experiment with name 'Default' does not exist. Creating a new experiment. 2024/06/01 12:00:00 INFO mlflow.tracking.fluent: Run with ID '1234567890abcdef' started. 2024/06/01 12:00:01 INFO mlflow.tracking.fluent: Run with ID '1234567890abcdef' ended.
Start the MLflow tracking UI to view logged parameters and artifacts from the pipeline run.
Terminal
mlflow ui
Expected OutputExpected
2024/06/01 12:00:05 INFO mlflow.server: Starting MLflow server at http://127.0.0.1:5000
--port 5000 - Specify the port where the MLflow UI will be accessible
Key Concept

If you remember nothing else from this pattern, remember: automating data preparation with scripts and tracking tools saves time and ensures consistent training data.

Code Example
MLOps
import mlflow
import pandas as pd

def load_data():
    data = pd.read_csv('data/raw_data.csv')
    return data

def clean_data(df):
    df_clean = df.dropna()
    return df_clean

def split_data(df):
    train = df.sample(frac=0.8, random_state=42)
    test = df.drop(train.index)
    return train, test

def main():
    mlflow.start_run()
    data = load_data()
    mlflow.log_param('raw_data_rows', len(data))
    clean = clean_data(data)
    mlflow.log_param('clean_data_rows', len(clean))
    train, test = split_data(clean)
    train.to_csv('data/train.csv', index=False)
    test.to_csv('data/test.csv', index=False)
    mlflow.log_artifact('data/train.csv')
    mlflow.log_artifact('data/test.csv')
    mlflow.end_run()

if __name__ == '__main__':
    main()
OutputSuccess
Common Mistakes
Not logging data artifacts or parameters in MLflow.
Without logging, you lose track of data versions and pipeline runs, making debugging and comparison hard.
Always use mlflow.log_param and mlflow.log_artifact to record important data and metadata.
Hardcoding file paths without using relative or configurable paths.
This causes the script to fail when run from different directories or environments.
Use relative paths or environment variables to specify data locations.
Summary
Create a Python script to automate loading, cleaning, and splitting training data.
Use MLflow to log parameters and data artifacts for tracking pipeline runs.
Run the script and use the MLflow UI to review data pipeline results and versions.

Practice

(1/5)
1. What is the main benefit of automating a training data pipeline in machine learning?
easy
A. It saves time and reduces human errors during data preparation.
B. It makes the model training faster by using GPUs.
C. It increases the size of the training dataset automatically.
D. It guarantees 100% accuracy of the machine learning model.

Solution

  1. Step 1: Understand the purpose of automation in data pipelines

    Automation helps by handling repetitive tasks consistently without manual intervention.
  2. Step 2: Identify the key benefits of automation

    Automation saves time and reduces errors that happen when humans prepare data manually.
  3. Final Answer:

    It saves time and reduces human errors during data preparation. -> Option A
  4. Quick Check:

    Automation = saves time and reduces errors [OK]
Hint: Automation mainly saves time and avoids mistakes [OK]
Common Mistakes:
  • Thinking automation speeds up model training directly
  • Assuming automation increases dataset size automatically
  • Believing automation guarantees perfect model accuracy
2. Which of the following is the correct Python syntax to define a simple function that automates a data cleaning step?
easy
A. clean_data(data) => data.dropna()
B. def clean_data(data):\n return data.dropna()
C. def clean_data(data):\nreturn data.dropna()
D. function clean_data(data) { return data.dropna() }

Solution

  1. Step 1: Identify correct Python function syntax

    Python functions start with 'def', followed by name and parameters, then indented body.
  2. Step 2: Check indentation and syntax correctness

    def clean_data(data):\n return data.dropna() uses correct indentation and syntax; others use wrong language syntax or missing indentation.
  3. Final Answer:

    def clean_data(data):\n return data.dropna() -> Option B
  4. Quick Check:

    Python function syntax = def + indent + return [OK]
Hint: Python functions need 'def' and proper indentation [OK]
Common Mistakes:
  • Using JavaScript syntax in Python
  • Missing indentation after function definition
  • Using arrow functions which are not Python syntax
3. Consider this Python code snippet automating a data pipeline step:
def normalize(data):
    mean = data.mean()
    std = data.std()
    return (data - mean) / std

import pandas as pd
sample = pd.Series([10, 20, 30])
result = normalize(sample)
print(result.round(2))

What is the printed output?
medium
A. [ -1.0, 0.0, 1.0 ]
B. [ -1.22, 0.00, 1.22 ]
C. [ 10, 20, 30 ]
D. [ 0.0, 0.0, 0.0 ]

Solution

  1. Step 1: Calculate mean and standard deviation of the sample

    Mean = (10+20+30)/3 = 20; Std deviation = 10 (pandas std() uses ddof=1 by default).
  2. Step 2: Normalize each value and round to 2 decimals

    (10-20)/10 = -1.0, (20-20)/10=0.0, (30-20)/10 = 1.0
  3. Final Answer:

    [ -1.0, 0.0, 1.0 ] -> Option A
  4. Quick Check:

    Normalization = (value-mean)/std [OK]
Hint: Normalize by subtracting mean and dividing by std [OK]
Common Mistakes:
  • Confusing standard deviation with variance
  • Not rounding output
  • Returning original data instead of normalized
4. You have this code snippet for automating data loading:
def load_data(file_path):
    data = pd.read_csv(file_path)
    return data

# Usage
dataset = load_data('data.csv')
print(dataset.head())

But it throws an error: NameError: name 'pd' is not defined. How do you fix it?
medium
A. Remove the function and read CSV directly.
B. Change 'pd.read_csv' to 'csv.read'.
C. Add 'import pandas as pd' at the top of the script.
D. Rename 'file_path' to 'filepath' in the function.

Solution

  1. Step 1: Understand the error message

    NameError means 'pd' is not recognized because pandas was not imported.
  2. Step 2: Fix by importing pandas with alias 'pd'

    Add 'import pandas as pd' at the top so 'pd.read_csv' works correctly.
  3. Final Answer:

    Add 'import pandas as pd' at the top of the script. -> Option C
  4. Quick Check:

    Import pandas as pd to use pd.read_csv [OK]
Hint: Always import pandas as pd before using pd functions [OK]
Common Mistakes:
  • Changing function parameter names without reason
  • Assuming csv module replaces pandas read_csv
  • Removing function instead of fixing import
5. You want to automate a training data pipeline that:
1. Loads CSV data,
2. Cleans missing values,
3. Normalizes numeric columns,
4. Saves the processed data.

Which tool or approach best supports scheduling and monitoring this pipeline automatically?
hard
A. Using Excel macros to clean and normalize data.
B. Writing a single Python script and running it manually each time.
C. Training the model directly without data preprocessing.
D. Using Apache Airflow to create and schedule pipeline tasks.

Solution

  1. Step 1: Identify requirements for automation and monitoring

    We need a tool that schedules tasks and tracks their success or failure.
  2. Step 2: Evaluate options for pipeline automation

    Apache Airflow is designed for scheduling, monitoring, and managing workflows automatically.
  3. Final Answer:

    Using Apache Airflow to create and schedule pipeline tasks. -> Option D
  4. Quick Check:

    Airflow = scheduling + monitoring pipelines [OK]
Hint: Use Airflow for automated scheduling and monitoring [OK]
Common Mistakes:
  • Running scripts manually instead of automating
  • Using Excel which lacks automation for pipelines
  • Skipping data preprocessing before training