0
0
AirflowConceptBeginner · 3 min read

What is Catchup in Airflow: Explanation and Usage

In Apache Airflow, catchup is a setting that controls whether missed scheduled runs should be executed when the scheduler starts or recovers. If catchup=True, Airflow will run all past DAG runs that were missed; if catchup=False, it will only run the latest scheduled run and skip the past ones.
⚙️

How It Works

Imagine you have a daily task that should run every day at 8 AM. If your Airflow scheduler is down for a few days, those daily runs are missed. The catchup feature decides if Airflow should go back and run all those missed days once the scheduler is back up.

When catchup=True, Airflow acts like a diligent friend who wants to complete all the missed work, running every past scheduled task until it catches up to today. When catchup=False, Airflow behaves like a friend who only cares about the current day and skips all the missed ones, running only the latest scheduled task.

This helps control workload and resource use, especially if running all missed tasks would be too heavy or unnecessary.

💻

Example

This example shows how to set catchup in a DAG to False, so Airflow skips past runs and only runs the latest scheduled task.

python
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2024, 4, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'example_catchup',
    default_args=default_args,
    schedule_interval='@daily',
    catchup=False  # This disables catchup
)

t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag
)
Output
When this DAG is triggered after missing some days, only the latest scheduled run will execute, skipping all previous missed runs.
🎯

When to Use

Use catchup=True when you want to ensure every scheduled run happens, such as for critical data processing or reports that must not miss any day.

Use catchup=False when missed runs are not important or would cause unnecessary load, like in cases where only the latest data matters or when running all missed tasks would overwhelm your system.

For example, if you have a daily report that can be skipped if missed, set catchup=False. But if you have a billing process that must run for every day, keep catchup=True.

Key Points

  • catchup=True runs all missed DAG runs to catch up.
  • catchup=False runs only the latest scheduled DAG run.
  • Setting catchup controls workload and resource use after downtime.
  • Choose catchup based on whether missed runs are important for your workflow.

Key Takeaways

The catchup setting controls if Airflow runs missed scheduled DAG runs after downtime.
Set catchup=False to skip past runs and only run the latest scheduled task.
Set catchup=True to ensure all scheduled runs happen, even if delayed.
Choosing catchup depends on how critical missed runs are for your workflow.