Why best practices prevent technical debt in Apache Airflow - Performance Analysis
We want to see how following best practices in Airflow affects the time it takes to manage workflows as they grow.
How does good structure help keep things running smoothly over time?
Analyze the time complexity of the following Airflow DAG setup.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def task_function():
pass
dag = DAG('example_dag', start_date=datetime(2024, 1, 1))
tasks = []
for i in range(100):
task = PythonOperator(
task_id=f'task_{i}',
python_callable=task_function,
dag=dag
)
tasks.append(task)
for i in range(99):
tasks[i] >> tasks[i+1]
This code creates 100 tasks linked in a chain inside an Airflow DAG.
Look at the loops and connections that repeat.
- Primary operation: Creating and linking 100 tasks in sequence.
- How many times: The first loop runs 100 times to create tasks; the second loop runs 99 times to link tasks.
As the number of tasks grows, the number of operations to create and link them grows roughly the same way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 19 (10 creations + 9 links) |
| 100 | About 199 (100 creations + 99 links) |
| 1000 | About 1999 (1000 creations + 999 links) |
Pattern observation: The work grows steadily as tasks increase, roughly doubling the number of tasks means doubling the operations.
Time Complexity: O(n)
This means the time to set up tasks grows in a straight line with the number of tasks.
[X] Wrong: "Adding more tasks won't affect setup time much because tasks run independently."
[OK] Correct: Even if tasks run separately, creating and linking them takes more time as you add more tasks, so setup time grows with task count.
Understanding how task setup time grows helps you design workflows that stay manageable and avoid hidden slowdowns as projects grow.
"What if we changed the linear chain to a parallel task setup? How would the time complexity change?"