Celery executor for distributed execution in Apache Airflow - Time & Space Complexity
When using the Celery executor in Airflow, tasks run on many workers at once. We want to understand how the total work time changes as we add more tasks.
How does the system handle more tasks and how does that affect execution time?
Analyze the time complexity of the following Airflow DAG snippet using Celery executor.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def task_function():
print("Task running")
dag = DAG('example_celery', start_date=datetime(2024, 1, 1))
task_list = []
for i in range(100):
task = PythonOperator(
task_id=f'task_{i}',
python_callable=task_function,
dag=dag
)
task_list.append(task)
This code creates 100 independent tasks that Celery workers can run in parallel.
Look at what repeats when running this DAG with Celery executor.
- Primary operation: Executing each of the 100 tasks independently.
- How many times: Once per task, so 100 times total.
As the number of tasks grows, the total work grows too, but workers can share the load.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 tasks executed |
| 100 | 100 tasks executed |
| 1000 | 1000 tasks executed |
Pattern observation: Total tasks increase linearly with input size, but parallel workers can reduce total elapsed time.
Time Complexity: O(n)
This means the total number of tasks grows directly with the input size, but parallel execution helps handle them efficiently.
[X] Wrong: "Adding more tasks will always make the total execution time grow linearly."
[OK] Correct: Because Celery runs tasks in parallel on many workers, total elapsed time can stay similar if enough workers are available.
Understanding how distributed task execution scales is a key skill. It shows you can think about how systems handle more work and use parallelism well.
"What if the number of workers is fixed and tasks keep increasing? How would the time complexity of total elapsed time change?"