0
0
Apache Airflowdevops~15 mins

Creating a basic DAG file in Apache Airflow - Mechanics & Internals

Choose your learning style9 modes available
Overview - Creating a basic DAG file
What is it?
A DAG file in Airflow is a Python script that defines a Directed Acyclic Graph, which is a set of tasks with dependencies. It tells Airflow what tasks to run, in what order, and when. This file is the blueprint for scheduling and executing workflows automatically. It uses simple Python code to describe tasks and their relationships.
Why it matters
Without DAG files, Airflow wouldn't know what workflows to run or how to organize tasks. This would make automating complex processes impossible, leading to manual work and errors. DAG files solve the problem of managing and scheduling tasks reliably and clearly, saving time and reducing mistakes in data pipelines or other automated jobs.
Where it fits
Before learning DAG files, you should understand basic Python and the concept of task automation. After mastering DAG files, you can explore advanced Airflow features like sensors, operators, and dynamic workflows to build complex pipelines.
Mental Model
Core Idea
A DAG file is a Python script that maps out tasks and their order so Airflow can run them automatically and reliably.
Think of it like...
Think of a DAG file like a recipe card that lists all the cooking steps and their order, so anyone can follow it to make the dish perfectly every time.
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Task 1    │────▶│   Task 2    │────▶│   Task 3    │
└─────────────┘     └─────────────┘     └─────────────┘

This shows tasks connected in order, like steps in a recipe.
Build-Up - 7 Steps
1
FoundationWhat is a DAG file in Airflow
🤔
Concept: Introduce the basic idea of a DAG file as a Python script defining tasks and their order.
A DAG file is a Python file that tells Airflow what tasks to run and when. It uses simple Python code to create tasks and set their order. Each task is like a step in a process, and the DAG connects these steps so Airflow knows the flow.
Result
You understand that a DAG file is the core way to tell Airflow about workflows.
Knowing that DAG files are just Python scripts helps you realize you can use Python's power to control workflows.
2
FoundationBasic structure of a DAG file
🤔
Concept: Learn the minimal parts needed to create a DAG file: imports, DAG definition, and tasks.
A basic DAG file has three parts: 1. Import Airflow modules. 2. Define the DAG with a unique name and schedule. 3. Create tasks using operators. Example: from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime default_args = {'start_date': datetime(2024, 1, 1)} dag = DAG('my_first_dag', default_args=default_args, schedule_interval='@daily') t1 = BashOperator(task_id='print_date', bash_command='date', dag=dag)
Result
You can write a minimal DAG file that Airflow can run.
Understanding the minimal structure lets you start building workflows quickly without confusion.
3
IntermediateDefining task dependencies
🤔Before reading on: do you think tasks run in the order they are defined or only when dependencies are set? Commit to your answer.
Concept: Learn how to set the order of tasks by defining dependencies between them.
Tasks in a DAG don't run just because they are defined. You must tell Airflow which tasks depend on others. Use >> or << operators to set this. Example: t1 >> t2 # t2 runs after t1 This means t2 waits for t1 to finish before starting.
Result
Tasks run in the order you specify, not just by their order in code.
Knowing that dependencies control execution order prevents mistakes where tasks run too early or out of order.
4
IntermediateUsing default arguments for DAGs
🤔Before reading on: do you think each task needs its own start date or can the DAG share one? Commit to your answer.
Concept: Learn to use default_args to set common parameters for all tasks in a DAG.
Instead of repeating settings like start_date for every task, you can set them once in default_args and pass it to the DAG. This keeps your code clean and consistent. Example: default_args = {'start_date': datetime(2024, 1, 1), 'retries': 1} dag = DAG('my_dag', default_args=default_args, schedule_interval='@daily')
Result
All tasks inherit these default settings unless overridden.
Using default_args reduces errors and makes managing many tasks easier.
5
IntermediateScheduling DAG runs
🤔
Concept: Understand how to control when and how often your DAG runs using schedule_interval.
The schedule_interval parameter tells Airflow how often to run the DAG. It can be a cron expression or presets like '@daily' or '@hourly'. Example: dag = DAG('my_dag', default_args=default_args, schedule_interval='@daily') This runs the DAG once every day at midnight.
Result
Your DAG runs automatically on the schedule you set.
Knowing scheduling lets you automate workflows without manual triggers.
6
AdvancedTask context and execution date
🤔Before reading on: do you think tasks can know when they run or only what commands to execute? Commit to your answer.
Concept: Learn that tasks receive context like execution date, which can be used inside commands or scripts.
Airflow passes context variables to tasks, such as execution_date, which tells when the task is running. You can use this in bash commands or Python code. Example: t1 = BashOperator( task_id='print_date', bash_command='echo {{ ds }}', # ds is execution date as string dag=dag )
Result
Tasks can adapt their behavior based on when they run.
Using context makes workflows dynamic and flexible for different run times.
7
ExpertAvoiding common DAG pitfalls
🤔Before reading on: do you think defining tasks inside the DAG context or outside affects execution? Commit to your answer.
Concept: Understand best practices to avoid errors like tasks running multiple times or DAG parsing issues.
Tasks should be defined inside the DAG context to avoid duplication. Avoid heavy computations or external calls at the top level of the DAG file because Airflow parses DAG files often. Example of bad practice: import requests response = requests.get('http://example.com') # runs on every parse Better: def my_task(): # code here This keeps DAG parsing fast and stable.
Result
Your DAGs run reliably without slowing down the scheduler.
Knowing how Airflow parses DAG files prevents subtle bugs and performance problems.
Under the Hood
Airflow reads DAG files as Python scripts regularly to discover workflows. It builds a graph of tasks and dependencies in memory. The scheduler uses this graph to decide which tasks to run and when. Each task runs in its own process or container, reporting status back to Airflow's database.
Why designed this way?
Using Python for DAG files gives flexibility and power to define complex workflows easily. The DAG structure ensures no cycles, so tasks run in a clear order. Parsing DAG files often allows Airflow to detect changes quickly and schedule tasks accurately.
┌───────────────┐      ┌───────────────┐      ┌───────────────┐
│ DAG File (.py)│─────▶│ Scheduler     │─────▶│ Task Executor │
└───────────────┘      └───────────────┘      └───────────────┘
       │                      │                      │
       ▼                      ▼                      ▼
  Python code            Builds DAG graph         Runs tasks
  defines tasks          and dependencies        in order
Myth Busters - 4 Common Misconceptions
Quick: Does Airflow run tasks in the order they appear in the DAG file? Commit yes or no.
Common Belief:Tasks run in the order they are written in the DAG file.
Tap to reveal reality
Reality:Airflow runs tasks based on dependencies, not code order. Tasks without dependencies can run in any order or in parallel.
Why it matters:Assuming code order controls execution can cause unexpected task runs and failures.
Quick: Can you put heavy code or API calls at the top level of a DAG file? Commit yes or no.
Common Belief:You can run any Python code anywhere in the DAG file without issues.
Tap to reveal reality
Reality:Heavy or external calls at the top level run every time Airflow parses the DAG, slowing the scheduler and causing errors.
Why it matters:This can make Airflow slow or unstable, delaying all workflows.
Quick: Does setting schedule_interval to None mean the DAG never runs? Commit yes or no.
Common Belief:If schedule_interval is None, the DAG will never run automatically.
Tap to reveal reality
Reality:A DAG with schedule_interval=None runs only when triggered manually, not on a schedule.
Why it matters:Misunderstanding this can cause workflows to never run unless manually started.
Quick: Can tasks share the same task_id in one DAG? Commit yes or no.
Common Belief:You can reuse task_ids in the same DAG for different tasks.
Tap to reveal reality
Reality:Task IDs must be unique within a DAG; duplicates cause errors.
Why it matters:Duplicate task IDs cause DAG parsing failures and prevent workflows from running.
Expert Zone
1
DAG files are parsed frequently by the scheduler and webserver, so keeping them lightweight improves system performance.
2
Using Jinja templating inside operators allows dynamic command generation based on execution context, enabling flexible workflows.
3
Task dependencies can be set using bitshift operators (>> and <<) or the set_upstream/set_downstream methods; choosing one style consistently improves readability.
When NOT to use
For very simple one-off scripts or manual tasks, using Airflow DAGs may be overkill. Alternatives like cron jobs or simple scripts can be better. Also, for event-driven workflows, tools like Apache NiFi or AWS Step Functions might be more suitable.
Production Patterns
In production, DAGs often use modular Python code with reusable functions and custom operators. They include error handling with retries and alerts. DAGs are version-controlled and tested before deployment to avoid breaking pipelines.
Connections
Directed Acyclic Graphs (DAGs) in Computer Science
Airflow DAGs are a practical application of DAGs used to model task dependencies.
Understanding DAGs in theory helps grasp why Airflow enforces no cycles and how task order is determined.
Cron Scheduling
Airflow's schedule_interval builds on cron concepts to automate task timing.
Knowing cron syntax helps create precise schedules for DAG runs.
Project Management Task Dependencies
Task dependencies in Airflow mirror how project tasks depend on each other to finish in order.
Familiarity with project planning clarifies why tasks must wait for others before starting.
Common Pitfalls
#1Defining tasks outside the DAG context causing multiple task instances.
Wrong approach:t1 = BashOperator(task_id='task1', bash_command='echo 1') dag = DAG('dag1', start_date=datetime(2024,1,1)) t2 = BashOperator(task_id='task2', bash_command='echo 2', dag=dag) t1 >> t2
Correct approach:with DAG('dag1', start_date=datetime(2024,1,1)) as dag: t1 = BashOperator(task_id='task1', bash_command='echo 1') t2 = BashOperator(task_id='task2', bash_command='echo 2') t1 >> t2
Root cause:Tasks defined outside the DAG context are not linked properly, causing Airflow to treat them as separate or duplicate tasks.
#2Using heavy API calls at the top level slowing DAG parsing.
Wrong approach:import requests response = requests.get('http://api.example.com/data') with DAG('dag', start_date=datetime(2024,1,1)) as dag: t1 = BashOperator(task_id='task', bash_command='echo done')
Correct approach:def fetch_data(): import requests return requests.get('http://api.example.com/data') with DAG('dag', start_date=datetime(2024,1,1)) as dag: t1 = BashOperator(task_id='task', bash_command='echo done')
Root cause:Top-level code runs every time Airflow parses the DAG, causing delays and possible failures.
#3Not setting task dependencies causing parallel runs.
Wrong approach:with DAG('dag', start_date=datetime(2024,1,1)) as dag: t1 = BashOperator(task_id='task1', bash_command='echo 1') t2 = BashOperator(task_id='task2', bash_command='echo 2')
Correct approach:with DAG('dag', start_date=datetime(2024,1,1)) as dag: t1 = BashOperator(task_id='task1', bash_command='echo 1') t2 = BashOperator(task_id='task2', bash_command='echo 2') t1 >> t2
Root cause:Without dependencies, Airflow runs tasks in parallel, which may cause errors if order matters.
Key Takeaways
A DAG file is a Python script that defines tasks and their order for Airflow to automate workflows.
Task dependencies control execution order, not the order of code lines in the DAG file.
Using default_args and schedule_interval simplifies DAG management and scheduling.
Avoid heavy code at the top level of DAG files to keep Airflow fast and stable.
Proper task definition inside the DAG context and clear dependencies prevent common errors.