0
0
AirflowComparisonBeginner · 4 min read

XCom vs External Storage in Airflow: Key Differences and Usage Guide

Use XCom in Airflow for small, simple data sharing between tasks within the same DAG run. Choose external storage like S3 or databases when handling large data, files, or when data needs to persist beyond DAG execution or be shared across DAGs.
⚖️

Quick Comparison

This table summarizes the main differences between XCom and external storage in Airflow.

FactorXComExternal Storage
Data SizeSmall (usually under 48KB)Large (files, big datasets)
Data TypeSimple Python objects (serialized)Any type (files, blobs, structured data)
PersistenceTemporary, tied to DAG runLong-term, independent of DAG runs
AccessibilityWithin same DAG run tasksAcross DAGs, external systems, or users
Performance ImpactCan slow scheduler if overusedNo impact on Airflow metadata DB
Use CasePassing small flags, IDs, or resultsStoring logs, large outputs, or shared data
⚖️

Key Differences

XCom (short for cross-communication) is designed for passing small pieces of data between tasks in the same DAG run. It stores data in Airflow's metadata database, which means it is limited in size (default max 48KB) and should only hold simple Python objects that can be serialized. Because XCom data is stored in the Airflow database, overusing it with large data can slow down the scheduler and cause performance issues.

On the other hand, external storage refers to using outside systems like Amazon S3, Google Cloud Storage, databases, or file servers to store data. This method is suitable for large files, datasets, or any data that needs to persist beyond the life of a DAG run or be shared across multiple DAGs or workflows. External storage does not affect Airflow's internal database performance and can handle any data type or size.

In summary, XCom is best for lightweight, temporary data sharing within a DAG, while external storage is ideal for heavy, persistent, or cross-DAG data needs.

⚖️

Code Comparison

Here is an example of passing a small value between tasks using XCom in Airflow.

python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def push_function(ti):
    ti.xcom_push(key='sample_key', value='Hello from XCom')

def pull_function(ti):
    message = ti.xcom_pull(key='sample_key', task_ids='push_task')
    print(f"Received message: {message}")

default_args = {'start_date': datetime(2024, 1, 1)}

dag = DAG('xcom_example', default_args=default_args, schedule_interval='@once')

push_task = PythonOperator(
    task_id='push_task',
    python_callable=push_function,
    dag=dag
)

pull_task = PythonOperator(
    task_id='pull_task',
    python_callable=pull_function,
    dag=dag
)

push_task >> pull_task
Output
Received message: Hello from XCom
↔️

External Storage Equivalent

This example shows how to write and read data using external storage (local file system) in Airflow tasks.

python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import os

def write_to_file():
    with open('/tmp/airflow_data.txt', 'w') as f:
        f.write('Hello from external storage')

def read_from_file():
    with open('/tmp/airflow_data.txt', 'r') as f:
        content = f.read()
    print(f"Read content: {content}")

default_args = {'start_date': datetime(2024, 1, 1)}

dag = DAG('external_storage_example', default_args=default_args, schedule_interval='@once')

write_task = PythonOperator(
    task_id='write_task',
    python_callable=write_to_file,
    dag=dag
)

read_task = PythonOperator(
    task_id='read_task',
    python_callable=read_from_file,
    dag=dag
)

write_task >> read_task
Output
Read content: Hello from external storage
🎯

When to Use Which

Choose XCom when:

  • You need to pass small, simple data like flags, IDs, or short strings between tasks in the same DAG run.
  • The data size is small enough to avoid performance issues (under 48KB).
  • You want quick, temporary communication without external dependencies.

Choose external storage when:

  • You need to handle large files, datasets, or complex data types.
  • The data must persist beyond the DAG run or be shared across multiple DAGs or workflows.
  • You want to avoid slowing down Airflow's metadata database or scheduler.
  • You require integration with cloud storage or databases for durability and scalability.

Key Takeaways

Use XCom for small, temporary data sharing within the same DAG run.
Use external storage for large, persistent, or cross-DAG data needs.
XCom data is limited in size and stored in Airflow's metadata database.
External storage avoids performance issues and supports any data type or size.
Choose based on data size, persistence needs, and sharing scope.