0
0
Apache Airflowdevops~5 mins

DAG concept (Directed Acyclic Graph) in Apache Airflow - Commands & Configuration

Choose your learning style9 modes available
Introduction
A DAG in Airflow is a way to organize tasks so they run in order without looping back. It helps you automate workflows by defining which tasks run first and which run after.
When you want to run a series of data processing steps one after another automatically.
When you need to schedule tasks to run at specific times without manual intervention.
When you want to make sure tasks don’t run in a circle causing endless loops.
When you want to track the progress and success of each step in a workflow.
When you want to retry failed tasks without restarting the whole process.
Config File - example_dag.py
example_dag.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 6, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'example_dag',
    default_args=default_args,
    description='A simple DAG example',
    schedule_interval='@daily',
    catchup=False,
)

task1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag,
)

task2 = BashOperator(
    task_id='sleep',
    bash_command='sleep 5',
    dag=dag,
)

task3 = BashOperator(
    task_id='echo_hello',
    bash_command='echo Hello Airflow',
    dag=dag,
)

# Define task order
 task1 >> task2 >> task3

This file defines a DAG named 'example_dag' that runs daily starting June 1, 2024.

It has three tasks: print_date, sleep, and echo_hello.

The tasks run in order: print_date first, then sleep, then echo_hello.

Default arguments set retry behavior and owner info.

Commands
This command lists all DAGs currently available in Airflow to confirm your DAG is recognized.
Terminal
airflow dags list
Expected OutputExpected
example_dag
This command manually starts a run of the example_dag to test if tasks execute in order.
Terminal
airflow dags trigger example_dag
Expected OutputExpected
Created <DagRun example_dag @ 2024-06-01T00:00:00+00:00: manual__2024-06-01T00:00:00+00:00, externally triggered: True>
This command lists all tasks in the example_dag to verify the tasks defined in the DAG file.
Terminal
airflow tasks list example_dag
Expected OutputExpected
print_date sleep echo_hello
This command runs the print_date task for the given date without affecting the DAG run state, useful for debugging.
Terminal
airflow tasks test example_dag print_date 2024-06-01
Expected OutputExpected
[2024-06-01 00:00:00,000] {bash.py:123} INFO - Running command: date [2024-06-01 00:00:00,100] {bash.py:130} INFO - Output: Wed Jun 1 00:00:00 UTC 2024 [2024-06-01 00:00:00,200] {taskinstance.py:123} INFO - Task succeeded
Key Concept

If you remember nothing else from this pattern, remember: a DAG is a set of tasks arranged so they run in order without loops.

Common Mistakes
Creating tasks without setting dependencies between them
Tasks will run in any order or all at once, breaking the intended workflow sequence.
Use >> or set_upstream/set_downstream to define the order tasks should run.
Defining a cycle in task dependencies (e.g., task1 >> task2 >> task1)
Airflow will fail to run the DAG because cycles cause infinite loops and break the acyclic rule.
Ensure dependencies form a directed acyclic graph with no loops.
Not setting a start_date or setting it in the future
The DAG will not run because Airflow schedules tasks starting from the start_date.
Set start_date to a past or current date to enable scheduling.
Summary
Define a DAG file with tasks and set their order using dependencies.
Use airflow CLI commands to list DAGs, trigger runs, and test tasks.
Remember a DAG must be acyclic and tasks run in the order you set.