DAG performance tracking in Apache Airflow - Time & Space Complexity
When tracking DAG performance in Airflow, we want to understand how the time to process tasks grows as the number of tasks increases.
We ask: How does adding more tasks affect the total time to complete the DAG?
Analyze the time complexity of this DAG task execution loop.
for task in dag.tasks:
task.run()
log_performance(task)
update_metrics(task)
# dag.tasks is a list of tasks in the DAG
# Each task.run() executes the task logic
# log_performance and update_metrics record timing info
This code runs each task in the DAG one by one and tracks its performance.
Look for repeated actions that affect time.
- Primary operation: Loop over all tasks in the DAG.
- How many times: Once per task, so as many times as there are tasks.
As the number of tasks grows, the total time grows roughly the same amount.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 task runs + 10 logs + 10 metric updates |
| 100 | 100 task runs + 100 logs + 100 metric updates |
| 1000 | 1000 task runs + 1000 logs + 1000 metric updates |
Pattern observation: Doubling tasks roughly doubles total work and time.
Time Complexity: O(n)
This means the total time grows linearly with the number of tasks in the DAG.
[X] Wrong: "Adding more tasks won't affect total execution time much because tasks run independently."
[OK] Correct: Even if tasks run independently, tracking and running each task still takes time, so more tasks mean more total work.
Understanding how task count affects DAG run time helps you design efficient workflows and explain performance trade-offs clearly.
"What if we parallelize task runs instead of running them sequentially? How would the time complexity change?"