0
0
Apache Airflowdevops~5 mins

Airflow architecture (scheduler, webserver, executor, metadata DB) - Commands & Configuration

Choose your learning style9 modes available
Introduction
Airflow helps you run and manage tasks automatically. It uses parts like the scheduler to plan tasks, the webserver to show status, the executor to run tasks, and a database to keep track of everything.
When you want to run data processing jobs on a schedule without doing it manually
When you need to see the progress and results of your tasks in a web browser
When you want to run many tasks in parallel efficiently
When you want to keep a record of all task runs and their status
When you want to retry failed tasks automatically
Config File - airflow.cfg
airflow.cfg
[core]
executor = LocalExecutor
sql_alchemy_conn = sqlite:///airflow.db

[scheduler]
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5

[webserver]
web_server_port = 8080

[database]
load_examples = False

This configuration file sets up Airflow's main parts:

  • executor: decides how tasks run; here it runs locally in parallel.
  • sql_alchemy_conn: connection to the metadata database; here it uses a local SQLite file.
  • scheduler: controls how often the scheduler checks for new tasks.
  • webserver: sets the port for the web interface.
  • database: controls database-related settings.
Commands
This command creates the metadata database and tables to store task and workflow information.
Terminal
airflow db init
Expected OutputExpected
INFO [alembic.runtime.migration] Context impl SQLiteImpl. INFO [alembic.runtime.migration] Will assume non-transactional DDL. INFO [alembic.runtime.migration] Running upgrade -> head INFO [alembic.runtime.migration] Upgrade done.
Starts the scheduler which looks for tasks to run and sends them to the executor.
Terminal
airflow scheduler
Expected OutputExpected
[2024-06-01 12:00:00,000] {scheduler_job.py:123} INFO - Starting the scheduler [2024-06-01 12:00:05,000] {scheduler_job.py:456} INFO - Scheduler heartbeat
Starts the webserver so you can open the Airflow UI in your browser on port 8080.
Terminal
airflow webserver -p 8080
Expected OutputExpected
[2024-06-01 12:00:00,000] {webserver.py:78} INFO - Starting web server on port 8080 [2024-06-01 12:00:00,100] {webserver.py:90} INFO - Webserver started
-p - Sets the port number for the webserver
Lists all tasks in the DAG named 'example_dag' to check what tasks are available to run.
Terminal
airflow tasks list example_dag
Expected OutputExpected
task_1 task_2 task_3
Key Concept

If you remember nothing else from this pattern, remember: Airflow uses the scheduler to plan tasks, the executor to run them, the webserver to show status, and the metadata database to keep track of everything.

Common Mistakes
Not initializing the metadata database before starting Airflow components
Airflow needs the database to store task info; without it, components fail to start or work properly
Always run 'airflow db init' before starting the scheduler or webserver
Running the webserver on a port already in use
The webserver will fail to start and show an error if the port is busy
Choose a free port with '-p' flag or stop the process using the port
Using SequentialExecutor in production expecting parallel task runs
SequentialExecutor runs tasks one by one, so parallelism is not possible
Use LocalExecutor or CeleryExecutor for parallel task execution
Summary
Initialize the metadata database with 'airflow db init' to prepare Airflow for use.
Start the scheduler to plan and send tasks to run automatically.
Run the webserver to access the Airflow UI and monitor tasks.
Use the metadata database to keep track of task states and history.