Bird
Raised Fist0
MLOpsdevops~30 mins

Pipeline components and DAGs in MLOps - Mini Project: Build & Apply

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Building a Simple MLOps Pipeline with Components and DAGs
📖 Scenario: You are working as a data engineer in a team that builds machine learning pipelines. Your task is to create a simple pipeline that has components for data loading, data preprocessing, and model training. These components will be connected in a Directed Acyclic Graph (DAG) to define the order of execution.This project will help you understand how pipeline components and DAGs work in MLOps.
🎯 Goal: Build a simple MLOps pipeline using Python dictionaries to represent components and a list to represent the DAG order. You will create components for data loading, preprocessing, and training, then connect them in a DAG, and finally print the execution order.
📋 What You'll Learn
Create a dictionary called components with keys 'load_data', 'preprocess_data', and 'train_model' each having a string description as value.
Create a list called dag that defines the execution order of the components as 'load_data', 'preprocess_data', 'train_model'.
Use a for loop to iterate over the dag list and print the component name and its description from the components dictionary.
💡 Why This Matters
🌍 Real World
In real MLOps, pipelines are built with components representing tasks like data loading, preprocessing, and training. These tasks are connected in a DAG to control the order of execution.
💼 Career
Understanding pipeline components and DAGs is essential for roles like MLOps engineer, data engineer, and machine learning engineer to automate and manage ML workflows efficiently.
Progress0 / 4 steps
1
Create pipeline components dictionary
Create a dictionary called components with these exact entries: 'load_data': 'Load raw data from source', 'preprocess_data': 'Clean and transform data', and 'train_model': 'Train ML model on processed data'.
MLOps
Hint

Use curly braces {} to create a dictionary. Each key is a string like 'load_data' and each value is a string description.

2
Define the DAG execution order
Create a list called dag with the exact order of component names: 'load_data', 'preprocess_data', 'train_model'.
MLOps
Hint

Use square brackets [] to create a list. Put the component names as strings in the correct order.

3
Iterate over DAG and print component info
Use a for loop with variable component to iterate over the dag list. Inside the loop, get the description from components[component] and print the component name and description in this format: "Component: {component}, Description: {description}".
MLOps
Hint

Use a for loop to go through each item in dag. Use f-strings to format the print output.

4
Print the pipeline execution order
Write a print statement to display the text exactly: "Pipeline execution order completed."
MLOps
Hint

Use print("Pipeline execution order completed.") exactly to show the final message.

Practice

(1/5)
1. What does a Directed Acyclic Graph (DAG) represent in an MLOps pipeline?
easy
A. Tasks and their dependencies without any cycles
B. A loop of tasks that repeat indefinitely
C. Random tasks executed in parallel without order
D. Only the final output of a pipeline

Solution

  1. Step 1: Understand DAG structure

    A DAG is a graph with nodes and edges where edges show dependencies and no cycles exist.
  2. Step 2: Relate DAG to pipeline tasks

    In MLOps, tasks are nodes and dependencies are edges, ensuring tasks run in order without loops.
  3. Final Answer:

    Tasks and their dependencies without any cycles -> Option A
  4. Quick Check:

    DAG = tasks + dependencies without loops [OK]
Hint: DAG means no loops, just tasks linked in order [OK]
Common Mistakes:
  • Thinking DAG allows loops
  • Confusing DAG with random task order
  • Assuming DAG only shows final output
2. Which of the following is the correct syntax to define a simple DAG in Apache Airflow?
easy
A. dag = DAG('my_dag', interval='daily')
B. dag = DAG('my_dag' schedule='daily')
C. dag = DAG('my_dag', schedule='everyday')
D. dag = DAG('my_dag', schedule_interval='@daily')

Solution

  1. Step 1: Check Airflow DAG syntax

    The DAG constructor requires a name and a schedule_interval parameter for timing.
  2. Step 2: Validate options

    dag = DAG('my_dag', schedule_interval='@daily') uses correct parameter 'schedule_interval' with valid value '@daily'. Others use wrong parameter names or values.
  3. Final Answer:

    dag = DAG('my_dag', schedule_interval='@daily') -> Option D
  4. Quick Check:

    Correct DAG syntax uses schedule_interval [OK]
Hint: Use schedule_interval='@daily' for daily DAGs [OK]
Common Mistakes:
  • Using 'schedule' instead of 'schedule_interval'
  • Wrong interval value formats
  • Missing commas between parameters
3. Given this Airflow DAG snippet, what is the order of task execution?
task1 = DummyOperator(task_id='task1', dag=dag)
task2 = DummyOperator(task_id='task2', dag=dag)
task3 = DummyOperator(task_id='task3', dag=dag)
task1 >> task2 >> task3
medium
A. task3, then task2, then task1
B. task1, then task2, then task3
C. task2, then task1, then task3
D. All tasks run in parallel

Solution

  1. Step 1: Analyze task dependencies

    The '>>' operator sets order: task1 before task2, task2 before task3.
  2. Step 2: Determine execution sequence

    Tasks run in sequence: task1 first, then task2, then task3.
  3. Final Answer:

    task1, then task2, then task3 -> Option B
  4. Quick Check:

    task1 >> task2 >> task3 means sequential order [OK]
Hint: >> means run left task before right task [OK]
Common Mistakes:
  • Assuming tasks run in reverse order
  • Thinking tasks run in parallel
  • Ignoring the '>>' operator meaning
4. You wrote this DAG code but get an error: TypeError: 'DAG' object is not iterable. What is the likely cause?
with DAG('example_dag', schedule_interval='@daily') as dag:
    task1 = DummyOperator(task_id='task1')
    task2 = DummyOperator(task_id='task2')
    task1 >> task2

for task in dag:
    print(task.task_id)
medium
A. DAG object is not iterable, so 'for task in dag' causes error
B. DummyOperator requires a 'dag' parameter outside the context
C. Missing import for DummyOperator
D. schedule_interval '@daily' is invalid

Solution

  1. Step 1: Identify error cause

    The error says 'DAG' object is not iterable, likely from trying to loop over dag object.
  2. Step 2: Understand DAG iterability

    DAG objects in Airflow are not iterable directly; looping over them causes this error.
  3. Final Answer:

    DAG object is not iterable, so 'for task in dag' causes error -> Option A
  4. Quick Check:

    DAG is not iterable; use dag.tasks list instead [OK]
Hint: DAG is not iterable; use dag.tasks to loop [OK]
Common Mistakes:
  • Trying to loop directly over DAG object
  • Assuming DummyOperator needs dag param outside context
  • Misreading error as import issue
5. You want to create a pipeline where task A runs first, then tasks B and C run in parallel, and finally task D runs after both B and C finish. Which DAG structure correctly represents this?
hard
A. [A, B] >> C >> D
B. A >> B >> C >> D
C. A >> [B, C] >> D
D. A >> D >> [B, C]

Solution

  1. Step 1: Understand task order requirements

    Task A runs first, then B and C run at the same time, then D runs after both finish.
  2. Step 2: Translate to DAG syntax

    Using Airflow syntax, 'A >> [B, C] >> D' means A before B and C in parallel, then D after both.
  3. Final Answer:

    A >> [B, C] >> D -> Option C
  4. Quick Check:

    Parallel tasks in list brackets between sequential tasks [OK]
Hint: Use brackets [] for parallel tasks in DAG [OK]
Common Mistakes:
  • Placing tasks in wrong order
  • Not using brackets for parallel tasks
  • Assuming linear order for all tasks