0
0
AirflowHow-ToBeginner · 4 min read

How to Set Parallelism in Airflow for Efficient Task Execution

In Airflow, you set parallelism by configuring the parallelism parameter in the airflow.cfg file or by setting max_active_tasks in your DAG or task. This controls how many tasks can run at the same time globally or per DAG, helping manage resource usage.
📐

Syntax

The main ways to set parallelism in Airflow are:

  • Global parallelism: Set parallelism in airflow.cfg to limit total concurrent tasks across all DAGs.
  • DAG-level parallelism: Use max_active_tasks or max_active_runs in the DAG definition to limit concurrency per DAG.
  • Task-level parallelism: Use pool and pool_slots to control task concurrency with resource pools.
ini/python
[core]
parallelism = 32

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime

default_args = {
    'start_date': datetime(2024, 1, 1),
}

dag = DAG(
    'example_dag',
    default_args=default_args,
    max_active_tasks=5,
    schedule_interval='@daily'
)

task1 = DummyOperator(task_id='task1', dag=dag)
💻

Example

This example shows how to set global parallelism in airflow.cfg and limit concurrent tasks in a DAG.

ini/python
[core]
parallelism = 10

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

default_args = {
    'start_date': datetime(2024, 1, 1),
}

dag = DAG(
    'parallelism_example',
    default_args=default_args,
    max_active_tasks=3,
    schedule_interval='@daily'
)

for i in range(5):
    BashOperator(
        task_id=f'task_{i}',
        bash_command='echo Running task',
        dag=dag
    )
Output
No direct output; Airflow scheduler limits running tasks to 10 globally and 3 per DAG.
⚠️

Common Pitfalls

Common mistakes when setting parallelism include:

  • Setting parallelism too high, causing resource exhaustion.
  • Confusing parallelism (global) with max_active_tasks (DAG-level).
  • Not using pools to manage resource-heavy tasks, leading to overload.
  • Forgetting to restart Airflow scheduler after changing airflow.cfg.
ini/python
## Wrong: Setting parallelism in DAG but ignoring global limit
[core]
parallelism = 2

# DAG with max_active_tasks=5 will still be limited by global parallelism=2

dag = DAG('wrong_parallelism', max_active_tasks=5)

## Right: Align global and DAG parallelism
[core]
parallelism = 10

dag = DAG('correct_parallelism', max_active_tasks=5)
📊

Quick Reference

SettingDescriptionWhere to Configure
parallelismMax concurrent tasks globallyairflow.cfg [core] section
max_active_tasksMax concurrent tasks per DAGDAG definition parameter
max_active_runsMax DAG runs active at onceDAG definition parameter
poolLimits concurrency for resource groupsAirflow UI or CLI
pool_slotsNumber of slots a task consumesTask parameter

Key Takeaways

Set global parallelism in airflow.cfg to control total concurrent tasks.
Use max_active_tasks in DAGs to limit concurrency per workflow.
Restart Airflow scheduler after changing parallelism settings.
Use pools to manage resource-heavy task concurrency effectively.
Align global and DAG parallelism to avoid unexpected task limits.