0
0
Apache Airflowdevops~5 mins

Sharing data between tasks effectively in Apache Airflow - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Sharing data between tasks effectively
O(n)
Understanding Time Complexity

When tasks share data in Airflow, the way data is passed affects how long the workflow takes to run.

We want to know how the time to share data grows as the amount of data or number of tasks increases.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def push_data(ti):
    data = list(range(1000))
    ti.xcom_push(key='my_data', value=data)

def pull_data(ti):
    data = ti.xcom_pull(key='my_data', task_ids='push_task')
    print(len(data))

dag = DAG('example_dag', start_date=datetime(2024, 1, 1))

push_task = PythonOperator(task_id='push_task', python_callable=push_data, dag=dag)
pull_task = PythonOperator(task_id='pull_task', python_callable=pull_data, dag=dag)
push_task >> pull_task

This code pushes a list of 1000 numbers from one task and pulls it in another using XComs.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Creating and transferring a list of data items between tasks.
  • How many times: The list size determines how many items are handled; here 1000 items.
How Execution Grows With Input

As the data size grows, the time to push and pull data grows roughly in proportion to the number of items.

Input Size (n)Approx. Operations
10Handles 10 data items
100Handles 100 data items
1000Handles 1000 data items

Pattern observation: Doubling the data roughly doubles the work to share it.

Final Time Complexity

Time Complexity: O(n)

This means the time to share data grows linearly with the amount of data passed between tasks.

Common Mistake

[X] Wrong: "Sharing data between tasks is always instant and does not depend on data size."

[OK] Correct: Actually, larger data means more time to serialize, transfer, and deserialize, so it takes longer.

Interview Connect

Understanding how data sharing scales helps you design workflows that run smoothly and predictably as they grow.

Self-Check

"What if we changed from passing data via XCom to using an external database? How would the time complexity change?"