Overview - Airflow architecture (scheduler, webserver, executor, metadata DB)

What is it?

Apache Airflow is a tool that helps you organize and run tasks automatically in a specific order. It has parts like the scheduler, webserver, executor, and metadata database that work together to manage and track these tasks. The scheduler decides when tasks should run, the executor runs them, the webserver shows you the status, and the metadata database stores all the information. This setup makes managing complex workflows easier and more visible.

Why it matters

Without Airflow's architecture, running many tasks in order would be chaotic and error-prone. People would have to manually start tasks and check if they finished, which wastes time and causes mistakes. Airflow automates this, making sure tasks run on time and you can see what’s happening. This saves effort, avoids errors, and helps teams deliver work faster and more reliably.

Where it fits

Before learning Airflow architecture, you should understand basic task automation and databases. After this, you can learn how to write workflows (DAGs) in Airflow and how to deploy Airflow in cloud or production environments. This topic is a foundation for mastering workflow orchestration.

Mental Model

Core Idea

Airflow architecture is a team of parts working together: the scheduler plans tasks, the executor runs them, the webserver shows progress, and the metadata database keeps records.

Think of it like...

Imagine a restaurant kitchen: the scheduler is the head chef who plans which dishes to cook and when, the executor is the cook who prepares the dishes, the webserver is the waiter who shows customers the order status, and the metadata database is the kitchen notebook where all orders and progress are recorded.

┌─────────────┐      ┌───────────────┐      ┌───────────────┐
│ Scheduler   │─────▶│ Executor      │─────▶│ Task Workers  │
└─────────────┘      └───────────────┘      └───────────────┘
       │                    │                      ▲
       ▼                    ▼                      │
┌─────────────┐      ┌───────────────┐             │
│ Metadata DB │◀─────│ Webserver     │─────────────┘
└─────────────┘      └───────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding Airflow's Purpose

Concept: Airflow helps automate and manage tasks that need to run in a specific order.

Airflow lets you define workflows as code, so you can schedule tasks like data processing or sending emails automatically. It handles running these tasks on time and in the right sequence.

Result

You can automate complex task sequences without manual intervention.

Understanding Airflow’s purpose helps you see why its architecture needs different parts working together.

2

FoundationMeet the Metadata Database

3

IntermediateRole of the Scheduler

4

IntermediateExecutor Runs the Tasks

5

IntermediateWebserver Shows Workflow Status

6

AdvancedHow Components Communicate

7

ExpertExecutor Types and Scalability Tradeoffs

Under the Hood

Airflow’s scheduler continuously queries the metadata database to find tasks ready to run based on schedules and dependencies. It then sends these tasks to the executor, which runs them either locally or on worker machines. Task states and logs are updated back into the metadata database. The webserver queries this database to provide real-time status and control. Executors may use message brokers to receive tasks asynchronously in distributed setups.

Why designed this way?

This design separates concerns: scheduling, execution, and user interface are independent, making the system modular and scalable. Using a metadata database as the central communication point ensures consistency and fault tolerance. Message brokers enable distributed execution without tight coupling. Alternatives like monolithic designs were less flexible and harder to scale.

┌─────────────┐       ┌───────────────┐       ┌───────────────┐
│ Scheduler   │──────▶│ Executor      │──────▶│ Task Workers  │
│ (plans)    │       │ (runs tasks)  │       │ (execute code)│
└─────┬───────┘       └─────┬─────────┘       └─────┬─────────┘
      │                     │                      │
      │                     │                      │
      ▼                     ▼                      ▼
┌─────────────────────────────────────────────────────────┐
│                   Metadata Database                      │
│  (stores task states, schedules, logs, workflow info)   │
└─────────────────────────────────────────────────────────┘
      ▲
      │
┌─────┴───────┐
│ Webserver   │
│ (UI & API)  │
└─────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does the scheduler execute tasks directly or only plan them? Commit to your answer.

Common Belief:The scheduler runs the tasks itself since it decides when they should start.

Tap to reveal reality

Quick: Is the webserver responsible for running tasks? Commit to your answer.

Common Belief:The webserver runs tasks because it shows task status and logs.

Tap to reveal reality

Quick: Does Airflow store task logs in the metadata database? Commit to your answer.

Common Belief:All logs are stored in the metadata database for easy access.

Tap to reveal reality

Quick: Does using CeleryExecutor always guarantee better performance? Commit to your answer.

Common Belief:CeleryExecutor always improves performance because it runs tasks on many workers.

Tap to reveal reality

Expert Zone

1

The scheduler’s heartbeat interval affects how quickly tasks start after becoming ready, balancing responsiveness and load.

2

Executors like KubernetesExecutor dynamically create pods per task, which adds startup latency but improves isolation and scalability.

3

The metadata database can become a bottleneck; tuning connection pools and using read replicas improves performance in large deployments.

When NOT to use

Airflow is not ideal for real-time or low-latency task execution; alternatives like Apache Kafka or specialized stream processors should be used instead. For simple cron jobs, native OS schedulers may be simpler and more efficient.

Production Patterns

In production, teams often use CeleryExecutor with Redis or RabbitMQ for distributed execution, combined with a PostgreSQL metadata database and a webserver behind a load balancer. They monitor scheduler health and tune executor concurrency to balance throughput and resource use.

Connections

Distributed Systems

Airflow’s executor and scheduler model builds on distributed system principles of decoupling and asynchronous task execution.

Understanding distributed systems helps grasp why Airflow separates scheduling and execution and uses message brokers.

Database Transaction Logs

The metadata database acts like a transaction log, recording every task state change reliably.

Knowing how transaction logs ensure consistency helps understand Airflow’s fault tolerance and recovery.

Restaurant Kitchen Workflow

Airflow’s architecture mirrors how a kitchen organizes orders, cooks, and serves dishes efficiently.

Seeing Airflow as a kitchen workflow clarifies the roles of scheduler, executor, and webserver in managing tasks.

Common Pitfalls

#1Trying to run tasks directly from the scheduler process.

Wrong approach:Starting tasks inside the scheduler code or expecting it to execute tasks inline.

Correct approach:Let the scheduler send tasks to the executor, which runs them separately.

Root cause:Misunderstanding the scheduler’s role as planner, not executor.

#2Ignoring metadata database performance tuning.

Wrong approach:Using default database settings without connection pooling or indexing.

Correct approach:Configure database connection pools, indexes, and consider read replicas for scaling.

Root cause:Underestimating the metadata database as a critical performance component.

#3Using SequentialExecutor in production for heavy workloads.

Wrong approach:Setting executor = SequentialExecutor in airflow.cfg for large task volumes.

Correct approach:Use CeleryExecutor or KubernetesExecutor for parallel and distributed execution.

Root cause:Not matching executor choice to workload scale.

Key Takeaways

Airflow’s architecture divides responsibilities: scheduler plans tasks, executor runs them, webserver shows status, and metadata database stores all info.

The metadata database is the central communication hub, ensuring all components stay in sync and track task states reliably.

Choosing the right executor is key to scaling Airflow workflows efficiently and depends on workload size and infrastructure.

Misunderstanding component roles leads to common errors; knowing each part’s function helps troubleshoot and optimize Airflow.

Airflow’s modular design supports flexibility, scalability, and fault tolerance, making it powerful for managing complex automated workflows.