Catchup and backfill behavior in Apache Airflow - Time & Space Complexity
When Airflow runs tasks for past dates, it uses catchup and backfill features.
We want to know how the number of runs grows as the date range grows.
Analyze the time complexity of this Airflow DAG scheduling snippet.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
default_args = {'start_date': datetime(2023, 1, 1)}
dag = DAG('example_dag', default_args=default_args, schedule_interval='@daily', catchup=True)
task = DummyOperator(task_id='dummy_task', dag=dag)
This DAG runs daily starting from Jan 1, 2023, with catchup enabled to run missed dates.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Airflow schedules one task run per missed day.
- How many times: Once for each day from start_date until today.
As the number of missed days grows, the number of task runs grows linearly.
| Input Size (days missed) | Approx. Operations (task runs) |
|---|---|
| 10 | 10 |
| 100 | 100 |
| 1000 | 1000 |
Pattern observation: The number of task runs grows directly with the number of missed days.
Time Complexity: O(n)
This means the work Airflow does grows in a straight line as the number of missed days increases.
[X] Wrong: "Catchup runs all missed tasks instantly with no extra cost as days increase."
[OK] Correct: Each missed day adds a separate task run, so more days mean more work and time.
Understanding how Airflow schedules past runs helps you explain system behavior clearly and shows you grasp task scaling.
"What if catchup was set to False? How would the time complexity of task runs change when starting the DAG late?"