Execution date vs logical date in Apache Airflow - Performance Comparison
When working with Airflow, tasks run based on dates. We want to understand how the number of task instances grows as we schedule more logical dates.
How does the difference between execution date (logical date) and actual run time affect the number of task instances executed?
Analyze the time complexity of scheduling this Airflow DAG over num_runs logical dates (e.g., via backfill or ongoing scheduling).
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime
dag = DAG(
'example_dag',
start_date=datetime(2023, 1, 1),
schedule_interval='@daily',
catchup=True # Enables backfill for past logical dates
)
task = DummyOperator(task_id='daily_task', dag=dag)
This code defines a DAG with a fixed number of tasks. Airflow creates one task instance per logical date per task.
Look for repeated actions in Airflow's execution model.
- Primary operation: Scheduler instantiates tasks for each logical date (execution_date).
- How many times: Once per logical date per task, controlled by
num_runs(number of DAG runs/scheduled dates).
As the number of logical dates (DAG runs) increases, the number of task instances grows linearly (assuming fixed tasks per DAG).
| Input Size (num_runs = logical dates) | Approx. Operations (task instances) |
|---|---|
| 10 | 10 |
| 100 | 100 |
| 1000 | 1000 |
Pattern observation: Linear growth O(n) where n is number of logical dates. (Scales with #tasks × n if multiple tasks.)
Time Complexity: O(n) task instances, where n = number of logical dates scheduled.
The work (executions) grows proportionally to scheduled dates.
[X] Wrong: "Execution date and logical date are the same, so scheduling more runs won't increase tasks."
[OK] Correct: Logical date (execution_date param) defines the data interval/slice. Actual execution happens later. More logical dates = more task instances, regardless of when they run.
Grasping execution_date vs. actual run time helps optimize backfills, predict load, and design scalable DAGs. Demonstrates Airflow scheduling depth.
What if the DAG had 5 tasks? How would time complexity change for num_runs dates?