Apache Airflowdevops~5 mins

Airflow architecture (scheduler, webserver, executor, metadata DB) - Commands & Configuration

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Airflow helps you run and manage tasks automatically. It uses parts like the scheduler to plan tasks, the webserver to show status, the executor to run tasks, and a database to keep track of everything.

When you want to run data processing jobs on a schedule without doing it manually

When you need to see the progress and results of your tasks in a web browser

When you want to run many tasks in parallel efficiently

When you want to keep a record of all task runs and their status

When you want to retry failed tasks automatically

Config File - airflow.cfg

airflow.cfg

[core]
executor = LocalExecutor
sql_alchemy_conn = sqlite:///airflow.db

[scheduler]
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5

[webserver]
web_server_port = 8080

[database]
load_examples = False

This configuration file sets up Airflow's main parts:

executor: decides how tasks run; here it runs locally in parallel.
sql_alchemy_conn: connection to the metadata database; here it uses a local SQLite file.
scheduler: controls how often the scheduler checks for new tasks.
webserver: sets the port for the web interface.
database: controls database-related settings.

Commands

This command creates the metadata database and tables to store task and workflow information.

Terminal

airflow db init

Expected OutputExpected

INFO [alembic.runtime.migration] Context impl SQLiteImpl. INFO [alembic.runtime.migration] Will assume non-transactional DDL. INFO [alembic.runtime.migration] Running upgrade -> head INFO [alembic.runtime.migration] Upgrade done.

Starts the scheduler which looks for tasks to run and sends them to the executor.

Terminal

airflow scheduler

Expected OutputExpected

[2024-06-01 12:00:00,000] {scheduler_job.py:123} INFO - Starting the scheduler [2024-06-01 12:00:05,000] {scheduler_job.py:456} INFO - Scheduler heartbeat

Starts the webserver so you can open the Airflow UI in your browser on port 8080.

Terminal

airflow webserver -p 8080

Expected OutputExpected

[2024-06-01 12:00:00,000] {webserver.py:78} INFO - Starting web server on port 8080 [2024-06-01 12:00:00,100] {webserver.py:90} INFO - Webserver started

→

-p - Sets the port number for the webserver

Lists all tasks in the DAG named 'example_dag' to check what tasks are available to run.

Terminal

airflow tasks list example_dag

Expected OutputExpected

task_1 task_2 task_3

Key Concept

If you remember nothing else from this pattern, remember: Airflow uses the scheduler to plan tasks, the executor to run them, the webserver to show status, and the metadata database to keep track of everything.

Common Mistakes

Not initializing the metadata database before starting Airflow components

Airflow needs the database to store task info; without it, components fail to start or work properly

Always run 'airflow db init' before starting the scheduler or webserver

Running the webserver on a port already in use

The webserver will fail to start and show an error if the port is busy

Choose a free port with '-p' flag or stop the process using the port

Using SequentialExecutor in production expecting parallel task runs

SequentialExecutor runs tasks one by one, so parallelism is not possible

Use LocalExecutor or CeleryExecutor for parallel task execution

Summary

Initialize the metadata database with 'airflow db init' to prepare Airflow for use.

Start the scheduler to plan and send tasks to run automatically.

Run the webserver to access the Airflow UI and monitor tasks.

Use the metadata database to keep track of task states and history.