0
0
Apache Airflowdevops~5 mins

Why scheduling automates pipeline execution in Apache Airflow - Why It Works

Choose your learning style9 modes available
Introduction
Scheduling lets you run your data tasks automatically at set times. This saves you from running them by hand and makes sure your data is always fresh.
When you want to run a data pipeline every day at midnight without manual work
When you need to process logs every hour to keep reports updated
When you want to trigger a backup job every Sunday morning automatically
When you want to run a machine learning training job weekly without forgetting
When you want to chain tasks that run one after another on a schedule
Config File - my_dag.py
my_dag.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 6, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'example_scheduled_dag',
    default_args=default_args,
    description='A simple scheduled DAG',
    schedule_interval='0 0 * * *',  # every day at midnight
    catchup=False,
)

task1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag,
)

This file defines a DAG (Directed Acyclic Graph) named example_scheduled_dag. It runs every day at midnight using schedule_interval='0 0 * * *'. The default_args set basic rules like start date and retry policy. The BashOperator runs a simple shell command to print the date. This setup automates running the task daily without manual triggers.

Commands
This command lists all DAGs Airflow knows about, so you can check if your scheduled DAG is recognized.
Terminal
airflow dags list
Expected OutputExpected
example_scheduled_dag
This command manually triggers the DAG to run immediately, useful for testing your schedule setup.
Terminal
airflow dags trigger example_scheduled_dag
Expected OutputExpected
Created <DagRun example_scheduled_dag @ 2024-06-01T12:00:00+00:00: manual__2024-06-01T12:00:00+00:00, externally triggered: True>
This shows all tasks inside the DAG so you know what will run on schedule.
Terminal
airflow tasks list example_scheduled_dag
Expected OutputExpected
print_date
This runs the task print_date for the date 2024-06-01 without affecting the scheduler, useful for debugging.
Terminal
airflow tasks test example_scheduled_dag print_date 2024-06-01
Expected OutputExpected
[2024-06-01 12:00:00,000] {bash.py:123} INFO - Running command: date [2024-06-01 12:00:00,100] {bash.py:130} INFO - Output: Sat Jun 1 12:00:00 UTC 2024 [2024-06-01 12:00:00,200] {taskinstance.py:123} INFO - Task succeeded
Key Concept

Scheduling in Airflow runs your pipelines automatically at set times so you don't have to start them manually.

Common Mistakes
Setting schedule_interval to None or empty string
This disables scheduling, so your DAG will never run automatically.
Use a valid cron expression or preset like '@daily' to enable automatic runs.
Not setting start_date or setting it in the future
Airflow won't run the DAG until the start_date is reached, so no runs happen if start_date is wrong.
Set start_date to a past or current date to allow scheduling to begin.
Forgetting to set catchup=False when you don't want backfill runs
Airflow will try to run all missed schedules since start_date, causing many unexpected runs.
Set catchup=False to run only the latest scheduled run.
Summary
Define a DAG with a schedule_interval to automate when it runs.
Use airflow CLI commands to list, trigger, and test your DAG and tasks.
Scheduling saves manual work by running pipelines automatically at set times.