How to Set Parallelism in Airflow for Efficient Task Execution
In Airflow, you set parallelism by configuring the
parallelism parameter in the airflow.cfg file or by setting max_active_tasks in your DAG or task. This controls how many tasks can run at the same time globally or per DAG, helping manage resource usage.Syntax
The main ways to set parallelism in Airflow are:
- Global parallelism: Set
parallelisminairflow.cfgto limit total concurrent tasks across all DAGs. - DAG-level parallelism: Use
max_active_tasksormax_active_runsin the DAG definition to limit concurrency per DAG. - Task-level parallelism: Use
poolandpool_slotsto control task concurrency with resource pools.
ini/python
[core] parallelism = 32 from airflow import DAG from airflow.operators.dummy import DummyOperator from datetime import datetime default_args = { 'start_date': datetime(2024, 1, 1), } dag = DAG( 'example_dag', default_args=default_args, max_active_tasks=5, schedule_interval='@daily' ) task1 = DummyOperator(task_id='task1', dag=dag)
Example
This example shows how to set global parallelism in airflow.cfg and limit concurrent tasks in a DAG.
ini/python
[core] parallelism = 10 from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime default_args = { 'start_date': datetime(2024, 1, 1), } dag = DAG( 'parallelism_example', default_args=default_args, max_active_tasks=3, schedule_interval='@daily' ) for i in range(5): BashOperator( task_id=f'task_{i}', bash_command='echo Running task', dag=dag )
Output
No direct output; Airflow scheduler limits running tasks to 10 globally and 3 per DAG.
Common Pitfalls
Common mistakes when setting parallelism include:
- Setting
parallelismtoo high, causing resource exhaustion. - Confusing
parallelism(global) withmax_active_tasks(DAG-level). - Not using pools to manage resource-heavy tasks, leading to overload.
- Forgetting to restart Airflow scheduler after changing
airflow.cfg.
ini/python
## Wrong: Setting parallelism in DAG but ignoring global limit [core] parallelism = 2 # DAG with max_active_tasks=5 will still be limited by global parallelism=2 dag = DAG('wrong_parallelism', max_active_tasks=5) ## Right: Align global and DAG parallelism [core] parallelism = 10 dag = DAG('correct_parallelism', max_active_tasks=5)
Quick Reference
| Setting | Description | Where to Configure |
|---|---|---|
| parallelism | Max concurrent tasks globally | airflow.cfg [core] section |
| max_active_tasks | Max concurrent tasks per DAG | DAG definition parameter |
| max_active_runs | Max DAG runs active at once | DAG definition parameter |
| pool | Limits concurrency for resource groups | Airflow UI or CLI |
| pool_slots | Number of slots a task consumes | Task parameter |
Key Takeaways
Set global parallelism in airflow.cfg to control total concurrent tasks.
Use max_active_tasks in DAGs to limit concurrency per workflow.
Restart Airflow scheduler after changing parallelism settings.
Use pools to manage resource-heavy task concurrency effectively.
Align global and DAG parallelism to avoid unexpected task limits.