When to use xcom vs external storage in airflow

AirflowComparisonBeginner · 4 min read

XCom vs External Storage in Airflow: Key Differences and Usage Guide

Use XCom in Airflow for small, simple data sharing between tasks within the same DAG run. Choose external storage like S3 or databases when handling large data, files, or when data needs to persist beyond DAG execution or be shared across DAGs.

⚖️

Quick Comparison

This table summarizes the main differences between XCom and external storage in Airflow.

Factor	XCom	External Storage
Data Size	Small (usually under 48KB)	Large (files, big datasets)
Data Type	Simple Python objects (serialized)	Any type (files, blobs, structured data)
Persistence	Temporary, tied to DAG run	Long-term, independent of DAG runs
Accessibility	Within same DAG run tasks	Across DAGs, external systems, or users
Performance Impact	Can slow scheduler if overused	No impact on Airflow metadata DB
Use Case	Passing small flags, IDs, or results	Storing logs, large outputs, or shared data

⚖️

Key Differences

XCom (short for cross-communication) is designed for passing small pieces of data between tasks in the same DAG run. It stores data in Airflow's metadata database, which means it is limited in size (default max 48KB) and should only hold simple Python objects that can be serialized. Because XCom data is stored in the Airflow database, overusing it with large data can slow down the scheduler and cause performance issues.

On the other hand, external storage refers to using outside systems like Amazon S3, Google Cloud Storage, databases, or file servers to store data. This method is suitable for large files, datasets, or any data that needs to persist beyond the life of a DAG run or be shared across multiple DAGs or workflows. External storage does not affect Airflow's internal database performance and can handle any data type or size.

In summary, XCom is best for lightweight, temporary data sharing within a DAG, while external storage is ideal for heavy, persistent, or cross-DAG data needs.

⚖️

Code Comparison

Here is an example of passing a small value between tasks using XCom in Airflow.

python

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def push_function(ti):
    ti.xcom_push(key='sample_key', value='Hello from XCom')

def pull_function(ti):
    message = ti.xcom_pull(key='sample_key', task_ids='push_task')
    print(f"Received message: {message}")

default_args = {'start_date': datetime(2024, 1, 1)}

dag = DAG('xcom_example', default_args=default_args, schedule_interval='@once')

push_task = PythonOperator(
    task_id='push_task',
    python_callable=push_function,
    dag=dag
)

pull_task = PythonOperator(
    task_id='pull_task',
    python_callable=pull_function,
    dag=dag
)

push_task >> pull_task

Output

Received message: Hello from XCom

↔️

External Storage Equivalent

This example shows how to write and read data using external storage (local file system) in Airflow tasks.

python

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import os

def write_to_file():
    with open('/tmp/airflow_data.txt', 'w') as f:
        f.write('Hello from external storage')

def read_from_file():
    with open('/tmp/airflow_data.txt', 'r') as f:
        content = f.read()
    print(f"Read content: {content}")

default_args = {'start_date': datetime(2024, 1, 1)}

dag = DAG('external_storage_example', default_args=default_args, schedule_interval='@once')

write_task = PythonOperator(
    task_id='write_task',
    python_callable=write_to_file,
    dag=dag
)

read_task = PythonOperator(
    task_id='read_task',
    python_callable=read_from_file,
    dag=dag
)

write_task >> read_task

Output

Read content: Hello from external storage

🎯

When to Use Which

Choose XCom when:

You need to pass small, simple data like flags, IDs, or short strings between tasks in the same DAG run.
The data size is small enough to avoid performance issues (under 48KB).
You want quick, temporary communication without external dependencies.

Choose external storage when:

You need to handle large files, datasets, or complex data types.
The data must persist beyond the DAG run or be shared across multiple DAGs or workflows.
You want to avoid slowing down Airflow's metadata database or scheduler.
You require integration with cloud storage or databases for durability and scalability.

✅

Key Takeaways

Use XCom for small, temporary data sharing within the same DAG run.

Use external storage for large, persistent, or cross-DAG data needs.

XCom data is limited in size and stored in Airflow's metadata database.

External storage avoids performance issues and supports any data type or size.

Choose based on data size, persistence needs, and sharing scope.