Sharing data between tasks effectively in Apache Airflow - Time & Space Complexity
When tasks share data in Airflow, the way data is passed affects how long the workflow takes to run.
We want to know how the time to share data grows as the amount of data or number of tasks increases.
Analyze the time complexity of the following code snippet.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def push_data(ti):
data = list(range(1000))
ti.xcom_push(key='my_data', value=data)
def pull_data(ti):
data = ti.xcom_pull(key='my_data', task_ids='push_task')
print(len(data))
dag = DAG('example_dag', start_date=datetime(2024, 1, 1))
push_task = PythonOperator(task_id='push_task', python_callable=push_data, dag=dag)
pull_task = PythonOperator(task_id='pull_task', python_callable=pull_data, dag=dag)
push_task >> pull_task
This code pushes a list of 1000 numbers from one task and pulls it in another using XComs.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Creating and transferring a list of data items between tasks.
- How many times: The list size determines how many items are handled; here 1000 items.
As the data size grows, the time to push and pull data grows roughly in proportion to the number of items.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | Handles 10 data items |
| 100 | Handles 100 data items |
| 1000 | Handles 1000 data items |
Pattern observation: Doubling the data roughly doubles the work to share it.
Time Complexity: O(n)
This means the time to share data grows linearly with the amount of data passed between tasks.
[X] Wrong: "Sharing data between tasks is always instant and does not depend on data size."
[OK] Correct: Actually, larger data means more time to serialize, transfer, and deserialize, so it takes longer.
Understanding how data sharing scales helps you design workflows that run smoothly and predictably as they grow.
"What if we changed from passing data via XCom to using an external database? How would the time complexity change?"