Bird
Raised Fist0
MLOpsdevops~10 mins

Pipeline components and DAGs in MLOps - Step-by-Step Execution

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Process Flow - Pipeline components and DAGs
Define pipeline components
Arrange components as tasks
Create DAG with dependencies
Schedule and run pipeline
Monitor task execution and results
Complete pipeline run
This flow shows how pipeline components become tasks arranged in a DAG, which is then scheduled and executed step-by-step.
Execution Sample
MLOps
task1 = load_data()
task2 = preprocess(task1)
task3 = train_model(task2)
task4 = evaluate(task3)
pipeline = DAG([task1, task2, task3, task4])
pipeline.run()
This code defines four tasks as pipeline components, arranges them in a DAG with dependencies, and runs the pipeline.
Process Table
StepTaskStatus BeforeActionStatus AfterOutput
1load_dataNot startedStart taskCompletedRaw data loaded
2preprocessWaiting for load_dataStart after load_dataCompletedData cleaned
3train_modelWaiting for preprocessStart after preprocessCompletedModel trained
4evaluateWaiting for train_modelStart after train_modelCompletedModel evaluated
5pipelineRunningAll tasks completedCompletedPipeline run successful
💡 All tasks completed in order respecting dependencies, pipeline run ends successfully
Status Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4Final
task1_statusNot startedCompletedCompletedCompletedCompletedCompleted
task2_statusNot startedNot startedCompletedCompletedCompletedCompleted
task3_statusNot startedNot startedNot startedCompletedCompletedCompleted
task4_statusNot startedNot startedNot startedNot startedCompletedCompleted
pipeline_statusRunningRunningRunningRunningRunningCompleted
Key Moments - 3 Insights
Why does 'preprocess' wait before starting?
'preprocess' depends on 'load_data' completing first, as shown in execution_table row 2 where status before is 'Waiting for load_data'. This ensures data is ready before cleaning.
Can tasks run in any order?
No, tasks run in order defined by dependencies in the DAG. For example, 'train_model' waits for 'preprocess' to finish (row 3). This prevents errors from missing inputs.
What happens if a task fails?
If a task fails, downstream tasks depending on it won't start. This is not shown here but is a key safety feature of DAGs to avoid running tasks with missing data.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the status of 'train_model' before step 3?
ANot started
BWaiting for preprocess
CCompleted
DRunning
💡 Hint
Check the 'Status Before' column for 'train_model' at step 3 in execution_table
At which step does the pipeline status change to 'Completed'?
AStep 4
BStep 3
CStep 5
DStep 2
💡 Hint
Look at the 'pipeline' row in execution_table and find when status changes to 'Completed'
If 'load_data' fails, what happens to 'preprocess'?
AIt waits indefinitely
BIt runs anyway
CIt starts immediately
DIt skips and pipeline completes
💡 Hint
Refer to key_moments explanation about task dependencies and waiting
Concept Snapshot
Pipeline components are tasks arranged in a Directed Acyclic Graph (DAG).
Each task depends on outputs of previous tasks.
The DAG ensures tasks run in order respecting dependencies.
Pipeline runs by executing tasks step-by-step.
Failures stop downstream tasks to keep data safe.
Full Transcript
This visual execution shows how pipeline components become tasks arranged in a DAG. The code example defines four tasks: load_data, preprocess, train_model, and evaluate. Each task waits for its dependencies to complete before starting. The execution table traces each step, showing task statuses before and after running. Variables track task states changing from 'Not started' to 'Completed'. Key moments clarify why tasks wait for others and how the DAG controls execution order. The quiz tests understanding of task statuses and pipeline completion. The snapshot summarizes the core idea: pipelines use DAGs to run tasks in order safely.

Practice

(1/5)
1. What does a Directed Acyclic Graph (DAG) represent in an MLOps pipeline?
easy
A. Tasks and their dependencies without any cycles
B. A loop of tasks that repeat indefinitely
C. Random tasks executed in parallel without order
D. Only the final output of a pipeline

Solution

  1. Step 1: Understand DAG structure

    A DAG is a graph with nodes and edges where edges show dependencies and no cycles exist.
  2. Step 2: Relate DAG to pipeline tasks

    In MLOps, tasks are nodes and dependencies are edges, ensuring tasks run in order without loops.
  3. Final Answer:

    Tasks and their dependencies without any cycles -> Option A
  4. Quick Check:

    DAG = tasks + dependencies without loops [OK]
Hint: DAG means no loops, just tasks linked in order [OK]
Common Mistakes:
  • Thinking DAG allows loops
  • Confusing DAG with random task order
  • Assuming DAG only shows final output
2. Which of the following is the correct syntax to define a simple DAG in Apache Airflow?
easy
A. dag = DAG('my_dag', interval='daily')
B. dag = DAG('my_dag' schedule='daily')
C. dag = DAG('my_dag', schedule='everyday')
D. dag = DAG('my_dag', schedule_interval='@daily')

Solution

  1. Step 1: Check Airflow DAG syntax

    The DAG constructor requires a name and a schedule_interval parameter for timing.
  2. Step 2: Validate options

    dag = DAG('my_dag', schedule_interval='@daily') uses correct parameter 'schedule_interval' with valid value '@daily'. Others use wrong parameter names or values.
  3. Final Answer:

    dag = DAG('my_dag', schedule_interval='@daily') -> Option D
  4. Quick Check:

    Correct DAG syntax uses schedule_interval [OK]
Hint: Use schedule_interval='@daily' for daily DAGs [OK]
Common Mistakes:
  • Using 'schedule' instead of 'schedule_interval'
  • Wrong interval value formats
  • Missing commas between parameters
3. Given this Airflow DAG snippet, what is the order of task execution?
task1 = DummyOperator(task_id='task1', dag=dag)
task2 = DummyOperator(task_id='task2', dag=dag)
task3 = DummyOperator(task_id='task3', dag=dag)
task1 >> task2 >> task3
medium
A. task3, then task2, then task1
B. task1, then task2, then task3
C. task2, then task1, then task3
D. All tasks run in parallel

Solution

  1. Step 1: Analyze task dependencies

    The '>>' operator sets order: task1 before task2, task2 before task3.
  2. Step 2: Determine execution sequence

    Tasks run in sequence: task1 first, then task2, then task3.
  3. Final Answer:

    task1, then task2, then task3 -> Option B
  4. Quick Check:

    task1 >> task2 >> task3 means sequential order [OK]
Hint: >> means run left task before right task [OK]
Common Mistakes:
  • Assuming tasks run in reverse order
  • Thinking tasks run in parallel
  • Ignoring the '>>' operator meaning
4. You wrote this DAG code but get an error: TypeError: 'DAG' object is not iterable. What is the likely cause?
with DAG('example_dag', schedule_interval='@daily') as dag:
    task1 = DummyOperator(task_id='task1')
    task2 = DummyOperator(task_id='task2')
    task1 >> task2

for task in dag:
    print(task.task_id)
medium
A. DAG object is not iterable, so 'for task in dag' causes error
B. DummyOperator requires a 'dag' parameter outside the context
C. Missing import for DummyOperator
D. schedule_interval '@daily' is invalid

Solution

  1. Step 1: Identify error cause

    The error says 'DAG' object is not iterable, likely from trying to loop over dag object.
  2. Step 2: Understand DAG iterability

    DAG objects in Airflow are not iterable directly; looping over them causes this error.
  3. Final Answer:

    DAG object is not iterable, so 'for task in dag' causes error -> Option A
  4. Quick Check:

    DAG is not iterable; use dag.tasks list instead [OK]
Hint: DAG is not iterable; use dag.tasks to loop [OK]
Common Mistakes:
  • Trying to loop directly over DAG object
  • Assuming DummyOperator needs dag param outside context
  • Misreading error as import issue
5. You want to create a pipeline where task A runs first, then tasks B and C run in parallel, and finally task D runs after both B and C finish. Which DAG structure correctly represents this?
hard
A. [A, B] >> C >> D
B. A >> B >> C >> D
C. A >> [B, C] >> D
D. A >> D >> [B, C]

Solution

  1. Step 1: Understand task order requirements

    Task A runs first, then B and C run at the same time, then D runs after both finish.
  2. Step 2: Translate to DAG syntax

    Using Airflow syntax, 'A >> [B, C] >> D' means A before B and C in parallel, then D after both.
  3. Final Answer:

    A >> [B, C] >> D -> Option C
  4. Quick Check:

    Parallel tasks in list brackets between sequential tasks [OK]
Hint: Use brackets [] for parallel tasks in DAG [OK]
Common Mistakes:
  • Placing tasks in wrong order
  • Not using brackets for parallel tasks
  • Assuming linear order for all tasks