0
0
Apache Airflowdevops~5 mins

Creating a basic DAG file in Apache Airflow - Step-by-Step CLI Walkthrough

Choose your learning style9 modes available
Introduction
A DAG file in Airflow defines a workflow as a set of tasks with dependencies. Creating a basic DAG file helps you automate and schedule tasks in a clear, repeatable way.
When you want to run a data processing job every day at a specific time.
When you need to automate sending reports weekly without manual effort.
When you want to chain tasks so one runs only after another finishes successfully.
When you want to monitor and retry failed tasks automatically.
When you want to keep track of task execution history and logs.
Config File - basic_dag.py
basic_dag.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'basic_dag',
    default_args=default_args,
    description='A simple tutorial DAG',
    schedule_interval=timedelta(days=1),
    catchup=False,
)

t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag,
)

t2 = BashOperator(
    task_id='sleep',
    bash_command='sleep 5',
    dag=dag,
)

t1 >> t2

default_args: sets default parameters for tasks like owner, retries, and start date.

DAG: defines the workflow with an ID, schedule, and description.

BashOperator: runs shell commands as tasks.

t1 >> t2: sets task order so t2 runs after t1.

Commands
Lists all DAGs currently available in Airflow to verify the new DAG is recognized.
Terminal
airflow dags list
Expected OutputExpected
basic_dag example_bash_operator example_python_operator
Manually triggers the basic_dag to run immediately for testing.
Terminal
airflow dags trigger basic_dag
Expected OutputExpected
Created <DagRun basic_dag @ 2024-06-01T12:00:00+00:00: manual__2024-06-01T12:00:00+00:00, externally triggered: True>
Lists all tasks defined in the basic_dag to confirm the tasks are loaded.
Terminal
airflow tasks list basic_dag
Expected OutputExpected
print_date sleep
Runs the print_date task for the given date without scheduling, useful for debugging.
Terminal
airflow tasks test basic_dag print_date 2024-06-01
Expected OutputExpected
[2024-06-01 12:00:00,000] {bash.py:123} INFO - Running command: date [2024-06-01 12:00:00,100] {bash.py:130} INFO - Output: Thu Jun 1 12:00:00 UTC 2024 [2024-06-01 12:00:00,200] {taskinstance.py:1234} INFO - Task exited with return code 0
Key Concept

If you remember nothing else from this pattern, remember: a DAG file defines tasks and their order to automate workflows in Airflow.

Common Mistakes
Not setting a start_date or setting it in the future.
Airflow will not schedule the DAG runs if the start_date is missing or after the current date.
Always set start_date to a past or current date to enable scheduling.
Forgetting to set catchup=False when you don't want backdated runs.
Airflow will try to run all missed schedules from start_date to now, causing many unexpected runs.
Set catchup=False in the DAG to run only from now on.
Not defining task dependencies, so tasks run in parallel unintentionally.
Tasks may run out of order, causing errors if one depends on another's output.
Use >> or set_upstream/set_downstream to define task order.
Summary
Create a Python file defining a DAG with tasks and their dependencies.
Use airflow CLI commands to list, trigger, and test the DAG and its tasks.
Set start_date and catchup properly to control scheduling behavior.