0
0
Apache Airflowdevops~5 mins

Catchup and backfill behavior in Apache Airflow - Commands & Configuration

Choose your learning style9 modes available
Introduction
Airflow schedules tasks to run at specific times. Sometimes, tasks miss their scheduled runs. Catchup and backfill help run those missed tasks so your data stays complete and up to date.
When your Airflow scheduler was down and missed running tasks on their scheduled dates.
When you add a new DAG with a start date in the past and want to run all missed tasks since then.
When you want to rerun tasks for past dates to fix data or update reports.
When you want to control whether Airflow automatically runs missed tasks or not.
When you want to manually trigger runs for specific past dates.
Config File - example_dag.py
example_dag.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

with DAG(
    dag_id='example_catchup_backfill',
    start_date=datetime(2024, 6, 1),
    schedule_interval='@daily',
    catchup=True,
    max_active_runs=1
) as dag:

    task1 = BashOperator(
        task_id='print_date',
        bash_command='date'
    )

dag_id: The unique name of the workflow.

start_date: When the DAG starts scheduling tasks.

schedule_interval: How often the DAG runs (daily here).

catchup: If True, Airflow runs all missed tasks since start_date.

max_active_runs: Limits how many DAG runs can run at the same time.

task1: A simple task that prints the current date.

Commands
Lists all DAGs available in Airflow to confirm your DAG is loaded.
Terminal
airflow dags list
Expected OutputExpected
example_catchup_backfill example_other_dag
Manually triggers a DAG run for June 5, 2024, to backfill or rerun that date's tasks.
Terminal
airflow dags trigger example_catchup_backfill --run-id manual__20240605
Expected OutputExpected
Created <DagRun example_catchup_backfill @ 2024-06-05 00:00:00: manual__20240605, externally triggered: True>
--run-id - Sets a custom run identifier for tracking this manual run.
Runs all missed DAG runs from June 1 to June 3, 2024, to backfill data for those dates.
Terminal
airflow dags backfill example_catchup_backfill -s 2024-06-01 -e 2024-06-03
Expected OutputExpected
Running backfill for example_catchup_backfill from 2024-06-01 to 2024-06-03 [2024-06-01 00:00:00] Task print_date succeeded [2024-06-02 00:00:00] Task print_date succeeded [2024-06-03 00:00:00] Task print_date succeeded Backfill done.
-s - Start date of backfill range.
-e - End date of backfill range.
Pauses the DAG to stop automatic scheduling and catchup runs.
Terminal
airflow dags pause example_catchup_backfill
Expected OutputExpected
Dag example_catchup_backfill is paused
Unpauses the DAG to resume automatic scheduling and catchup runs.
Terminal
airflow dags unpause example_catchup_backfill
Expected OutputExpected
Dag example_catchup_backfill is unpaused
Key Concept

If you remember nothing else, remember: catchup=True makes Airflow run all missed scheduled tasks since the start date automatically, while backfill lets you manually run tasks for past dates.

Common Mistakes
Setting catchup=False but expecting Airflow to run missed tasks automatically.
With catchup=False, Airflow skips all past missed runs and only runs the latest scheduled task.
Set catchup=True in the DAG if you want Airflow to run all missed tasks automatically.
Running backfill without specifying start and end dates.
Backfill needs a date range to know which past tasks to run; without it, the command fails or runs unexpected dates.
Always use -s and -e flags with airflow dags backfill to specify the exact date range.
Pausing a DAG and expecting scheduled tasks to run.
Paused DAGs do not schedule or run any tasks, including catchup or backfill.
Unpause the DAG to allow scheduled and catchup runs.
Summary
Set catchup=True in your DAG to let Airflow automatically run all missed scheduled tasks since the start date.
Use airflow dags backfill with start and end dates to manually run tasks for specific past dates.
Pause and unpause DAGs to control whether scheduled and catchup runs happen automatically.