Orchestrating dbt with Airflow - Time & Space Complexity
When using Airflow to run dbt tasks, it's important to understand how the total time grows as you add more dbt models or steps.
We want to know how the orchestration time changes when the number of dbt tasks increases.
Analyze the time complexity of this Airflow DAG running dbt models sequentially.
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
def create_dbt_task(model_name, dag):
return BashOperator(
task_id=f'dbt_run_{model_name}',
bash_command=f'dbt run --models {model_name}',
dag=dag
)
dag = DAG('dbt_sequential', start_date=datetime(2024,1,1))
models = ['model1', 'model2', 'model3', 'model4']
previous_task = None
for model in models:
task = create_dbt_task(model, dag)
if previous_task:
previous_task >> task
previous_task = task
This code runs dbt models one after another in Airflow, waiting for each to finish before starting the next.
Look at what repeats as the input grows.
- Primary operation: Running each dbt model as a separate Airflow task.
- How many times: Once per model, so the number of tasks equals the number of models.
As you add more dbt models, the total time grows because each model runs one after another.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 dbt runs, one after another |
| 100 | 100 dbt runs, sequentially |
| 1000 | 1000 dbt runs, sequentially |
Pattern observation: The total time grows roughly in direct proportion to the number of models.
Time Complexity: O(n)
This means the total orchestration time grows linearly as you add more dbt models to run one after another.
[X] Wrong: "Running multiple dbt models in Airflow always happens at the same time, so time stays the same no matter how many models."
[OK] Correct: In this setup, models run one after another, so adding more models adds more total time.
Understanding how task orchestration time grows helps you design workflows that scale well and keep pipelines efficient.
What if we changed the Airflow DAG to run all dbt models in parallel instead of sequentially? How would the time complexity change?