Why DAG design determines pipeline reliability in Apache Airflow - Performance Analysis
We want to see how the design of a DAG affects how long it takes to run a pipeline.
Specifically, how the number of tasks and their connections impact execution time.
Analyze the time complexity of the following DAG setup.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def task_func():
pass
dag = DAG('example_dag', start_date=datetime(2024,1,1))
tasks = [PythonOperator(task_id=f'task_{i}', python_callable=task_func, dag=dag) for i in range(10)]
for i in range(9):
tasks[i] >> tasks[i+1]
This DAG creates 10 tasks linked in a chain, where each task waits for the previous one.
Look at the chain of tasks and how they run one after another.
- Primary operation: Executing each task in sequence.
- How many times: 10 tasks run one after the other.
As you add more tasks, the total time grows because each task waits for the previous one.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 tasks run in order |
| 100 | 100 tasks run one after another |
| 1000 | 1000 tasks run sequentially |
Pattern observation: The total execution time grows roughly in direct proportion to the number of tasks.
Time Complexity: O(n)
This means the pipeline takes longer as you add more tasks, growing in a straight line with the number of tasks.
[X] Wrong: "Adding more tasks won't affect total run time much because they run fast."
[OK] Correct: Even if tasks are fast, running them one after another adds up time linearly, so more tasks mean longer total time.
Understanding how task order affects pipeline time helps you design better workflows and shows you think about real-world pipeline reliability.
"What if we changed the chain to run some tasks in parallel? How would the time complexity change?"