Practice

(1/5)

1. What is the main benefit of automating a training data pipeline in machine learning?

easy

A. It saves time and reduces human errors during data preparation.

B. It makes the model training faster by using GPUs.

C. It increases the size of the training dataset automatically.

D. It guarantees 100% accuracy of the machine learning model.

Solution

Step 1: Understand the purpose of automation in data pipelines
Automation helps by handling repetitive tasks consistently without manual intervention.
Step 2: Identify the key benefits of automation
Automation saves time and reduces errors that happen when humans prepare data manually.
Final Answer:
It saves time and reduces human errors during data preparation. -> Option A
Quick Check:
Automation = saves time and reduces errors [OK]

Hint: Automation mainly saves time and avoids mistakes [OK]

Common Mistakes:

Thinking automation speeds up model training directly
Assuming automation increases dataset size automatically
Believing automation guarantees perfect model accuracy

2. Which of the following is the correct Python syntax to define a simple function that automates a data cleaning step?

easy

A. clean_data(data) => data.dropna()

B. def clean_data(data):\n return data.dropna()

C. def clean_data(data):\nreturn data.dropna()

D. function clean_data(data) { return data.dropna() }

Solution

Step 1: Identify correct Python function syntax
Python functions start with 'def', followed by name and parameters, then indented body.
Step 2: Check indentation and syntax correctness
def clean_data(data):\n return data.dropna() uses correct indentation and syntax; others use wrong language syntax or missing indentation.
Final Answer:
def clean_data(data):\n return data.dropna() -> Option B
Quick Check:
Python function syntax = def + indent + return [OK]

Hint: Python functions need 'def' and proper indentation [OK]

Common Mistakes:

Using JavaScript syntax in Python
Missing indentation after function definition
Using arrow functions which are not Python syntax

3. Consider this Python code snippet automating a data pipeline step:

def normalize(data):
    mean = data.mean()
    std = data.std()
    return (data - mean) / std

import pandas as pd
sample = pd.Series([10, 20, 30])
result = normalize(sample)
print(result.round(2))

What is the printed output?

medium

A. [ -1.0, 0.0, 1.0 ]

B. [ -1.22, 0.00, 1.22 ]

C. [ 10, 20, 30 ]

D. [ 0.0, 0.0, 0.0 ]

Solution

Step 1: Calculate mean and standard deviation of the sample
Mean = (10+20+30)/3 = 20; Std deviation = 10 (pandas std() uses ddof=1 by default).
Step 2: Normalize each value and round to 2 decimals
(10-20)/10 = -1.0, (20-20)/10=0.0, (30-20)/10 = 1.0
Final Answer:
[ -1.0, 0.0, 1.0 ] -> Option A
Quick Check:
Normalization = (value-mean)/std [OK]

Hint: Normalize by subtracting mean and dividing by std [OK]

Common Mistakes:

Confusing standard deviation with variance
Not rounding output
Returning original data instead of normalized

4. You have this code snippet for automating data loading:

def load_data(file_path):
    data = pd.read_csv(file_path)
    return data

# Usage
dataset = load_data('data.csv')
print(dataset.head())

But it throws an error: NameError: name 'pd' is not defined. How do you fix it?

medium

A. Remove the function and read CSV directly.

B. Change 'pd.read_csv' to 'csv.read'.

C. Add 'import pandas as pd' at the top of the script.

D. Rename 'file_path' to 'filepath' in the function.

Solution

Step 1: Understand the error message
NameError means 'pd' is not recognized because pandas was not imported.
Step 2: Fix by importing pandas with alias 'pd'
Add 'import pandas as pd' at the top so 'pd.read_csv' works correctly.
Final Answer:
Add 'import pandas as pd' at the top of the script. -> Option C
Quick Check:
Import pandas as pd to use pd.read_csv [OK]

Hint: Always import pandas as pd before using pd functions [OK]

Common Mistakes:

Changing function parameter names without reason
Assuming csv module replaces pandas read_csv
Removing function instead of fixing import

5. You want to automate a training data pipeline that:
1. Loads CSV data,
2. Cleans missing values,
3. Normalizes numeric columns,
4. Saves the processed data.

Which tool or approach best supports scheduling and monitoring this pipeline automatically?

hard

A. Using Excel macros to clean and normalize data.

B. Writing a single Python script and running it manually each time.

C. Training the model directly without data preprocessing.

D. Using Apache Airflow to create and schedule pipeline tasks.

Solution

Step 1: Identify requirements for automation and monitoring
We need a tool that schedules tasks and tracks their success or failure.
Step 2: Evaluate options for pipeline automation
Apache Airflow is designed for scheduling, monitoring, and managing workflows automatically.
Final Answer:
Using Apache Airflow to create and schedule pipeline tasks. -> Option D
Quick Check:
Airflow = scheduling + monitoring pipelines [OK]

Hint: Use Airflow for automated scheduling and monitoring [OK]

Common Mistakes:

Running scripts manually instead of automating
Using Excel which lacks automation for pipelines
Skipping data preprocessing before training

Why Training data pipeline automation in MLOps? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of automation in data pipelines

Step 2: Identify the key benefits of automation

Final Answer:

Quick Check:

Solution

Step 1: Identify correct Python function syntax

Step 2: Check indentation and syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Calculate mean and standard deviation of the sample

Step 2: Normalize each value and round to 2 decimals

Final Answer:

Quick Check:

Solution

Step 1: Understand the error message

Step 2: Fix by importing pandas with alias 'pd'

Final Answer:

Quick Check:

Solution

Step 1: Identify requirements for automation and monitoring

Step 2: Evaluate options for pipeline automation

Final Answer:

Quick Check: