High availability configuration in Apache Airflow - Time & Space Complexity
When setting up high availability in Airflow, we want to know how the system handles more tasks and workers.
We ask: How does adding more workers or tasks affect the system's work and response time?
Analyze the time complexity of the following Airflow scheduler and worker setup.
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
def create_task(task_id, dag):
return BashOperator(
task_id=task_id,
bash_command='echo Hello',
dag=dag
)
dag = DAG('ha_dag', start_date=datetime(2024,1,1))
tasks = [create_task(f'task_{i}', dag) for i in range(100)]
This code creates 100 tasks in a DAG, simulating a workload for a high availability Airflow setup.
Look for loops or repeated actions that affect performance.
- Primary operation: Creating and scheduling each task in the DAG.
- How many times: 100 times, once per task.
As the number of tasks grows, the scheduler must handle more work.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 task creations and scheduling steps |
| 100 | 100 task creations and scheduling steps |
| 1000 | 1000 task creations and scheduling steps |
Pattern observation: The work grows directly with the number of tasks; doubling tasks doubles the work.
Time Complexity: O(n)
This means the time to schedule tasks grows in a straight line with the number of tasks.
[X] Wrong: "Adding more workers makes scheduling time stay the same no matter how many tasks there are."
[OK] Correct: Even with more workers, the scheduler still processes each task, so the total scheduling work grows with tasks.
Understanding how task scheduling scales helps you design reliable Airflow setups that keep running smoothly as work grows.
"What if we split the tasks into multiple DAGs running in parallel? How would the time complexity of scheduling change?"