MLOpsdevops~30 mins

Training data pipeline automation in MLOps - Mini Project: Build & Apply

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Training Data Pipeline Automation

📖 Scenario: You are working as a machine learning engineer. You need to automate the process of preparing training data for your ML model. This involves collecting raw data, filtering it based on quality, and then outputting the cleaned data ready for training.

🎯 Goal: Build a simple Python script that automates a training data pipeline. The script will start with raw data, apply a quality filter, and then output the cleaned data.

📋 What You'll Learn

Create a dictionary with raw data samples and their quality scores

Add a quality threshold variable to filter data

Use a dictionary comprehension to select only data samples above the threshold

Print the filtered data dictionary

💡 Why This Matters

🌍 Real World

Automating data preparation saves time and reduces errors in machine learning projects by ensuring only good quality data is used for training.

💼 Career

Data engineers and ML engineers often build automated pipelines like this to prepare data efficiently and reliably for model training.

Progress0 / 4 steps

Create raw data dictionary

Create a dictionary called raw_data with these exact entries: 'sample1': 0.85, 'sample2': 0.45, 'sample3': 0.95, 'sample4': 0.30, 'sample5': 0.75 representing data sample names and their quality scores.

MLOps

# Create the raw_data dictionary with sample names and quality scores
# Your code here

Hint

Use curly braces to create a dictionary. Each entry has a sample name as a string key and a float value for quality.

Set quality threshold

Create a variable called quality_threshold and set it to 0.7 to filter out low-quality data samples.

MLOps

raw_data = {'sample1': 0.85, 'sample2': 0.45, 'sample3': 0.95, 'sample4': 0.30, 'sample5': 0.75}
# Create the quality_threshold variable and set it to 0.7
# Your code here

Hint

Just assign the number 0.7 to the variable named quality_threshold.

Filter data using dictionary comprehension

Create a new dictionary called filtered_data using dictionary comprehension. Include only those entries from raw_data where the quality score is greater than or equal to quality_threshold. Use sample and score as the loop variables.

MLOps

raw_data = {'sample1': 0.85, 'sample2': 0.45, 'sample3': 0.95, 'sample4': 0.30, 'sample5': 0.75}
quality_threshold = 0.7
# Create filtered_data dictionary with samples having score >= quality_threshold
# Your code here

Hint

Use dictionary comprehension syntax: {key: value for key, value in dict.items() if condition}.

Print filtered data

Write a print statement to display the filtered_data dictionary.

MLOps

raw_data = {'sample1': 0.85, 'sample2': 0.45, 'sample3': 0.95, 'sample4': 0.30, 'sample5': 0.75}
quality_threshold = 0.7
filtered_data = {sample: score for sample, score in raw_data.items() if score >= quality_threshold}
# Print the filtered_data dictionary
# Your code here

Hint

Use print(filtered_data) to show the filtered dictionary.

Practice

(1/5)

1. What is the main benefit of automating a training data pipeline in machine learning?

easy

A. It saves time and reduces human errors during data preparation.

B. It makes the model training faster by using GPUs.

C. It increases the size of the training dataset automatically.

D. It guarantees 100% accuracy of the machine learning model.

Training data pipeline automation in MLOps - Mini Project: Build & Apply

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of automation in data pipelines

Step 2: Identify the key benefits of automation

Final Answer:

Quick Check:

Solution

Step 1: Identify correct Python function syntax

Step 2: Check indentation and syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Calculate mean and standard deviation of the sample

Step 2: Normalize each value and round to 2 decimals

Final Answer:

Quick Check:

Solution

Step 1: Understand the error message

Step 2: Fix by importing pandas with alias 'pd'

Final Answer:

Quick Check:

Solution

Step 1: Identify requirements for automation and monitoring

Step 2: Evaluate options for pipeline automation

Final Answer:

Quick Check: