Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Apache Airflow for ML orchestration
📖 Scenario: You are working as a data engineer in a team that builds machine learning models. You want to automate the steps of your ML workflow using Apache Airflow. This will help your team run the training and evaluation tasks automatically every day without manual work.
🎯 Goal: Build a simple Apache Airflow DAG that orchestrates three ML tasks: data extraction, model training, and model evaluation. You will create the DAG structure, add configuration for scheduling, define the tasks, and finally print the task order to verify the workflow.
📋 What You'll Learn
Create a DAG with the id ml_workflow
Set the DAG schedule interval to run daily at 7 AM
Define three PythonOperator tasks named extract_data, train_model, and evaluate_model
Set task dependencies so that extract_data runs before train_model, and train_model runs before evaluate_model
Print the list of task ids in the order they will run
💡 Why This Matters
🌍 Real World
Automating ML workflows with Apache Airflow helps teams run complex pipelines reliably and on schedule without manual intervention.
💼 Career
Understanding Airflow DAGs and task orchestration is essential for ML engineers and data engineers working in MLOps roles.
Progress0 / 4 steps
1
Create the DAG structure
Import DAG from airflow and create a DAG object called ml_workflow with dag_id='ml_workflow' and start_date=datetime(2024, 1, 1). Import datetime from datetime module.
MLOps
Hint
Use DAG(dag_id='ml_workflow', start_date=datetime(2024, 1, 1)) to create the DAG.
2
Add schedule interval configuration
Add the schedule_interval parameter to the ml_workflow DAG and set it to '0 7 * * *' to run daily at 7 AM.
MLOps
Hint
Set schedule_interval='0 7 * * *' inside the DAG constructor.
3
Define ML tasks using PythonOperator
Import PythonOperator from airflow.operators.python. Define three tasks named extract_data, train_model, and evaluate_model using PythonOperator. Each task should have a task_id matching its name and a python_callable that is a simple function printing the task name. Assign all tasks to the dag.
MLOps
Hint
Define simple functions that print messages, then create PythonOperator tasks with matching task_id and python_callable.
4
Set task dependencies and print task order
Set the task order so that extract_data runs before train_model, and train_model runs before evaluate_model using the bitshift operators >>. Then print the list of task ids in the order they will run by accessing dag.topological_sort() and printing each task's task_id.
MLOps
Hint
Use extract_data >> train_model >> evaluate_model to set dependencies. Use dag.topological_sort() to get tasks in order.
Practice
(1/5)
1. What is the main purpose of Apache Airflow in ML orchestration?
easy
A. To store large datasets for ML training
B. To write ML model code in Python
C. To visualize ML model performance metrics
D. To automate and schedule ML workflows as directed tasks
Solution
Step 1: Understand Airflow's role
Apache Airflow is designed to automate workflows by scheduling and running tasks in order.
Step 2: Differentiate from other ML tools
It does not store data, visualize metrics, or write model code but manages task execution.
Final Answer:
To automate and schedule ML workflows as directed tasks -> Option D
Quick Check:
Airflow = workflow automation [OK]
Hint: Airflow schedules tasks, not data or model code [OK]
Common Mistakes:
Confusing Airflow with data storage tools
Thinking Airflow writes ML model code
Assuming Airflow visualizes model metrics
2. Which of the following is the correct way to define a DAG in Apache Airflow using Python?
easy
A. dag = DAG('my_dag', run_every='daily')
B. dag = DAG('my_dag', schedule_interval='@daily')
C. dag = DAG('my_dag', interval='daily')
D. dag = DAG('my_dag', schedule='daily')
Solution
Step 1: Recall DAG initialization syntax
The correct parameter to set schedule is schedule_interval, not run_every, interval, or schedule.
Step 2: Verify the example
dag = DAG('my_dag', schedule_interval='@daily') is the standard syntax to schedule daily runs.
Final Answer:
dag = DAG('my_dag', schedule_interval='@daily') -> Option B
Quick Check:
Use schedule_interval to set DAG timing [OK]
Hint: Use schedule_interval to set DAG timing [OK]
Common Mistakes:
Using incorrect parameter names like run_every
Confusing schedule_interval with schedule
Forgetting to use quotes around '@daily'
3. Given the following Airflow DAG snippet, what will be the order of task execution?
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def task_a():
print('Task A')
def task_b():
print('Task B')
def task_c():
print('Task C')
dag = DAG('example_dag', start_date=datetime(2024, 1, 1), schedule_interval='@once')
t1 = PythonOperator(task_id='a', python_callable=task_a, dag=dag)
t2 = PythonOperator(task_id='b', python_callable=task_b, dag=dag)
t3 = PythonOperator(task_id='c', python_callable=task_c, dag=dag)
t1 >> t2 >> t3
medium
A. Task A, then Task B, then Task C
B. Task C, then Task B, then Task A
C. Task A, Task B, and Task C run in parallel
D. Task B, then Task A, then Task C
Solution
Step 1: Understand task dependencies
The operator chaining t1 >> t2 >> t3 means t1 runs first, then t2, then t3.
Step 2: Confirm execution order
Tasks print in order: Task A, Task B, Task C.
Final Answer:
Task A, then Task B, then Task C -> Option A
Quick Check:
Operator chaining sets task order [OK]
Hint: >> means run left task before right task [OK]
Common Mistakes:
Assuming tasks run in parallel without dependencies
Misreading the >> operator order
Confusing task IDs with execution order
4. You wrote this Airflow DAG code but get an error: TypeError: DAG.__init__() got an unexpected keyword argument 'start' What is the likely cause?
dag = DAG('my_dag', start='2024-01-01', schedule_interval='@daily')
medium
A. The parameter should be start_date, not start
B. The schedule_interval value '@daily' is invalid
C. DAG name cannot be 'my_dag'
D. Missing import for datetime module
Solution
Step 1: Identify incorrect parameter
The error says start is unexpected; Airflow expects start_date.
Step 2: Confirm correct parameter usage
Replacing start with start_date fixes the error.
Final Answer:
The parameter should be start_date, not start -> Option A
Quick Check:
Use start_date, not start [OK]
Hint: Use start_date, not start, for DAG start time [OK]
Common Mistakes:
Using 'start' instead of 'start_date'
Assuming '@daily' is invalid schedule
Ignoring error message details
5. You want to create an Airflow DAG that runs an ML training task only if data preprocessing succeeded. Which Airflow feature should you use to enforce this dependency?
hard
A. Schedule both tasks to run at the same time
B. Use Airflow Variables to store task status
C. Set task dependencies using >> operator between preprocessing and training tasks
D. Write a single Python function combining both tasks
Solution
Step 1: Understand task dependency in Airflow
Airflow uses task dependencies to control execution order, ensuring one task runs after another succeeds.
Step 2: Apply dependency operator
Using the >> operator sets the training task to run only after preprocessing completes successfully.
Final Answer:
Set task dependencies using >> operator between preprocessing and training tasks -> Option C
Quick Check:
Use >> to enforce task order [OK]
Hint: Use >> to link tasks in order [OK]
Common Mistakes:
Thinking Variables control task order
Scheduling tasks simultaneously without dependencies