0
0
Apache Airflowdevops~10 mins

Database backend optimization in Apache Airflow - Commands & Configuration

Choose your learning style9 modes available
Introduction
Airflow uses a database to keep track of tasks and workflows. Optimizing this database backend helps Airflow run faster and handle more work without slowing down.
When Airflow tasks are running slower than expected due to database delays
When the Airflow web UI takes too long to load task or DAG information
When you want to reduce database load during peak workflow execution times
When scaling Airflow to support more users and workflows without performance drops
When cleaning up old task records to keep the database size manageable
Config File - airflow.cfg
airflow.cfg
[core]
executor = LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@localhost:5432/airflow

[database]
sql_alchemy_pool_size = 10
sql_alchemy_max_overflow = 20
sql_alchemy_pool_recycle = 1800

[scheduler]
max_threads = 4

[logging]
logging_level = INFO

[metrics]
enable = True

The sql_alchemy_conn sets the connection string to the PostgreSQL database used by Airflow.

The sql_alchemy_pool_size and sql_alchemy_max_overflow control how many database connections Airflow keeps open to improve speed.

The sql_alchemy_pool_recycle helps avoid stale connections by recycling them every 1800 seconds.

Adjusting max_threads in the scheduler controls how many tasks Airflow can schedule in parallel, affecting database load.

Enabling metrics helps monitor database performance over time.

Commands
This command upgrades the Airflow database schema to the latest version to ensure compatibility and performance improvements.
Terminal
airflow db upgrade
Expected OutputExpected
INFO [alembic.runtime.migration] Context impl PostgresqlImpl. INFO [alembic.runtime.migration] Will assume transactional DDL. INFO [alembic.runtime.migration] Running upgrade -> 2.5.0, Add new columns and indexes INFO [alembic.runtime.migration] Running upgrade 2.5.0 -> 2.6.0, Optimize task instance queries Upgrade successful
This command checks the current state of the Airflow database to detect any issues or inconsistencies.
Terminal
airflow db check
Expected OutputExpected
INFO [airflow.utils.db] Database connection successful INFO [airflow.utils.db] Database schema is up to date
This command cleans up old task instance records before the specified date to reduce database size and improve performance.
Terminal
airflow db cleanup --clean-before-timestamp 2023-01-01T00:00:00
Expected OutputExpected
INFO [airflow.utils.db] Deleted 1500 old task instance records before 2023-01-01T00:00:00 Cleanup complete
--clean-before-timestamp - Specify the cutoff date for cleaning old records
This runs the Airflow scheduler once to test if the database connection and configuration are optimized for task scheduling.
Terminal
airflow scheduler --num_runs 1
Expected OutputExpected
INFO [airflow.scheduler] Starting scheduler INFO [airflow.scheduler] Scheduler ran 1 iteration(s) INFO [airflow.scheduler] Scheduler exiting
--num_runs - Limit the scheduler to run a fixed number of iterations for testing
Key Concept

If you remember nothing else from database backend optimization, remember: tuning connection pooling and cleaning old data keeps Airflow fast and stable.

Common Mistakes
Not upgrading the database schema after Airflow updates
Old schemas can cause slow queries and errors because new features expect updated tables and indexes
Always run 'airflow db upgrade' after upgrading Airflow to keep the database schema current
Setting the connection pool size too low or too high without testing
Too low causes slow task scheduling due to waiting for connections; too high wastes resources and can overload the database
Start with moderate pool sizes like 10 and 20 overflow, then adjust based on monitoring metrics
Ignoring old task records buildup in the database
Large numbers of old records slow down queries and increase storage needs
Regularly run 'airflow db cleanup' with appropriate timestamps to remove old data
Summary
Run 'airflow db upgrade' to keep the database schema updated for performance.
Configure connection pooling in airflow.cfg to balance speed and resource use.
Clean old task records regularly to keep the database size manageable and queries fast.
Test scheduler runs to verify database optimizations improve task scheduling speed.