0
0
Apache Airflowdevops~15 mins

Pushing and pulling XCom values in Apache Airflow - Deep Dive

Choose your learning style9 modes available
Overview - Pushing and pulling XCom values
What is it?
XComs in Airflow are a way for tasks to share small pieces of data between each other during a workflow run. Pushing an XCom means sending a value from one task, and pulling means retrieving that value in another task. This helps tasks communicate and coordinate without external storage. It works behind the scenes in Airflow to pass messages or results.
Why it matters
Without XComs, tasks would be isolated and unable to share results or signals easily, making workflows rigid and complex. XComs solve the problem of task communication inside Airflow, enabling dynamic workflows that adapt based on previous task outputs. This makes automation smarter and more efficient.
Where it fits
Before learning XComs, you should understand basic Airflow concepts like DAGs, tasks, and operators. After mastering XComs, you can explore advanced workflow patterns like branching, dynamic task generation, and cross-DAG communication.
Mental Model
Core Idea
XComs are Airflow's built-in message board where tasks post and read small data pieces to coordinate work.
Think of it like...
Imagine a group project where team members leave sticky notes on a shared board with updates or results. Each member can read these notes to decide what to do next.
┌─────────────┐       ┌─────────────┐
│  Task A     │       │  Task B     │
│ (push XCom) │──────▶│ (pull XCom) │
└─────────────┘       └─────────────┘
       │                     ▲
       │                     │
       └───── Shared XCom ───┘
             Storage
Build-Up - 7 Steps
1
FoundationWhat is an XCom in Airflow
🤔
Concept: Introduce the basic idea of XCom as a data-sharing mechanism between tasks.
In Airflow, XCom stands for 'cross-communication'. It lets tasks send small messages or data to each other during a workflow run. Each XCom has a key, a value, and is linked to the task and DAG run. This is built into Airflow and requires no extra setup.
Result
You understand that XComs are like notes tasks leave for others inside Airflow.
Understanding that Airflow has a built-in way for tasks to share data is key to building dynamic workflows.
2
FoundationHow to push XCom values from a task
🤔
Concept: Learn the basic method to send data from a task using XCom push.
In a PythonOperator, you can push an XCom by returning a value from the task function. Alternatively, you can call `ti.xcom_push(key, value)` inside the task code, where `ti` is the task instance passed automatically. For example: def task_function(ti): ti.xcom_push(key='result', value=42) This sends the number 42 with the key 'result' to the XCom store.
Result
The task stores the value 42 in Airflow's XCom system under the key 'result'.
Knowing how to push data lets you share results or signals for other tasks to use.
3
IntermediatePulling XCom values in downstream tasks
🤔Before reading on: do you think you can pull any XCom value by just knowing the key, or do you need more info? Commit to your answer.
Concept: Learn how to retrieve XCom values in another task using the task instance.
To get an XCom value, use `ti.xcom_pull()` inside the downstream task. You usually specify the task ID that pushed the value and the key. For example: value = ti.xcom_pull(task_ids='task_a', key='result') This fetches the value pushed by 'task_a' with key 'result'. If no key is given, it returns the latest XCom value.
Result
The downstream task receives the value 42 pushed earlier, enabling it to use that data.
Understanding that pulling requires knowing the source task and key prevents confusion and errors.
4
IntermediateUsing return values as implicit XCom pushes
🤔Before reading on: do you think returning a value from a PythonOperator always pushes an XCom? Commit to yes or no.
Concept: Discover that returning a value from a PythonOperator automatically pushes it as an XCom with key 'return_value'.
When a PythonOperator's function returns a value, Airflow automatically pushes it as an XCom with the key 'return_value'. For example: def task_func(): return 'hello' This means you don't need to call `xcom_push` explicitly if you just want to share the return value.
Result
The returned string 'hello' is stored as an XCom and can be pulled by downstream tasks.
Knowing this shortcut simplifies code and reduces boilerplate for common cases.
5
IntermediateXComs with non-Python operators
🤔
Concept: Understand how XComs work with operators that are not Python-based, like BashOperator.
Operators like BashOperator can push XComs by printing a special line to stdout. For example, if the bash script prints `{{ ti.xcom_push(key='result', value='data') }}`, Airflow captures it. Alternatively, you can set `do_xcom_push=True` and the last line output is pushed automatically.
Result
Non-Python tasks can also share data via XComs, enabling cross-language communication.
Knowing how XComs work beyond Python broadens your ability to integrate diverse tasks.
6
AdvancedXCom backend and size limitations
🤔Before reading on: do you think XComs can store any size of data safely? Commit to yes or no.
Concept: Learn about the storage backend of XComs and practical limits on data size.
By default, XComs are stored in Airflow's metadata database as pickled objects. This means large data can bloat the database and slow down Airflow. Best practice is to keep XComs small, like strings or numbers. For large data, use external storage (S3, databases) and pass references via XCom.
Result
You avoid performance issues by not overloading XComs with big data.
Understanding storage limits prevents common scaling and reliability problems.
7
ExpertCustom XCom backends and security considerations
🤔Before reading on: do you think XCom data is encrypted or secure by default? Commit to yes or no.
Concept: Explore how Airflow allows custom XCom backends and the security implications of XCom data.
Airflow 2.3+ supports custom XCom backends, letting you store XComs in places like S3 or encrypted stores. By default, XComs are stored unencrypted in the metadata DB, which can expose sensitive data. Custom backends help secure data and handle large payloads. Implementing a custom backend requires subclassing BaseXCom and configuring Airflow accordingly.
Result
You can tailor XCom storage for security and scalability in production environments.
Knowing how to customize XCom storage is crucial for secure, large-scale workflows.
Under the Hood
XComs are stored as records in Airflow's metadata database linked to DAG runs, task IDs, and keys. When a task pushes an XCom, Airflow serializes the value (usually with pickle) and saves it. Pulling an XCom queries this table and deserializes the value. The task instance (ti) object provides methods to interact with this storage transparently during task execution.
Why designed this way?
Airflow needed a simple, built-in way for tasks to share data without external dependencies. Using the metadata DB leverages existing infrastructure and keeps the system self-contained. Serialization allows storing complex Python objects. Alternatives like external storage were possible but would complicate setup and reduce portability.
┌───────────────┐
│ Task Instance │
│  (ti object)  │
└──────┬────────┘
       │ xcom_push(key, value)
       ▼
┌─────────────────────┐
│ Airflow Metadata DB  │
│  (XCom table store)  │
└──────┬──────────────┘
       │ xcom_pull(task_ids, key)
       ▼
┌───────────────┐
│ Downstream    │
│ Task Reads XCom│
└───────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does returning None from a PythonOperator push an XCom? Commit yes or no.
Common Belief:Returning any value, including None, always pushes an XCom.
Tap to reveal reality
Reality:Returning None does NOT push an XCom; only non-None return values are pushed automatically.
Why it matters:Assuming None pushes an XCom can cause downstream tasks to fail when expecting data.
Quick: Can XComs share large files or big datasets directly? Commit yes or no.
Common Belief:XComs can safely store any size of data, including large files.
Tap to reveal reality
Reality:XComs are limited by database size and performance; large data should be stored externally with XComs passing references.
Why it matters:Storing large data in XComs can crash Airflow or slow it down severely.
Quick: Are XCom values encrypted by default? Commit yes or no.
Common Belief:XCom data is secure and encrypted automatically by Airflow.
Tap to reveal reality
Reality:XComs are stored unencrypted in the metadata database by default.
Why it matters:Sensitive data in XComs can be exposed if the database is compromised.
Quick: Can any task pull XComs from any other DAG run? Commit yes or no.
Common Belief:Tasks can pull XComs from any DAG run or any task at any time.
Tap to reveal reality
Reality:XComs are scoped to a specific DAG run and task instance; cross-run or cross-DAG pulls require special handling.
Why it matters:Misunderstanding scope can cause tasks to pull wrong or no data, breaking workflows.
Expert Zone
1
XComs serialize data with pickle by default, which can cause issues with non-serializable objects or security risks if untrusted data is deserialized.
2
The 'key' parameter in XComs is optional but using explicit keys improves clarity and avoids collisions in complex DAGs.
3
Custom XCom backends can optimize performance and security but require careful implementation to maintain Airflow compatibility.
When NOT to use
Avoid using XComs for large data transfers or sensitive information without encryption. Instead, use external storage systems like S3, databases, or secrets managers, and pass references or IDs via XComs.
Production Patterns
In production, teams use XComs to pass small flags, IDs, or JSON snippets between tasks. They combine XComs with external storage for large files. Custom backends or encryption plugins secure sensitive data. Monitoring XCom size and usage prevents database bloat.
Connections
Message Queues (e.g., RabbitMQ, Kafka)
XComs are a lightweight, built-in message passing system within Airflow, similar in purpose to external message queues but scoped to task communication.
Understanding XComs as a simple message queue helps grasp their role in coordinating asynchronous tasks.
Shared Memory in Operating Systems
XComs act like shared memory segments where processes (tasks) exchange data safely and efficiently within a controlled environment.
Seeing XComs as shared memory clarifies why data size and serialization matter for performance and correctness.
Sticky Notes on a Team Whiteboard
XComs function like sticky notes left on a shared board for team members to read and act upon.
This connection highlights the importance of clear keys and timely reading to avoid confusion or stale data.
Common Pitfalls
#1Trying to push large files directly into XComs.
Wrong approach:ti.xcom_push(key='file', value=open('largefile.zip', 'rb').read())
Correct approach:Upload the file to S3 or another storage, then push the file path or URL via XCom: ti.xcom_push(key='file_path', value='s3://bucket/largefile.zip')
Root cause:Misunderstanding that XComs store data in the metadata DB, which is not designed for large binary blobs.
#2Pulling XCom without specifying the correct task_id or key.
Wrong approach:value = ti.xcom_pull() # no task_ids or key specified
Correct approach:value = ti.xcom_pull(task_ids='task_a', key='result')
Root cause:Assuming XCom pull defaults will find the intended data, leading to None or wrong values.
#3Returning None from a PythonOperator expecting to push an XCom.
Wrong approach:def task_func(): return None # expects this to push an XCom
Correct approach:def task_func(): return 'some_value' # non-None return pushes XCom
Root cause:Not knowing that Airflow only pushes XComs automatically for non-None returns.
Key Takeaways
XComs are Airflow's built-in way for tasks to share small data pieces during a workflow run.
Pushing XComs can be done by returning values or explicitly calling xcom_push in task code.
Pulling XComs requires knowing the source task ID and key to retrieve the correct data.
XComs are stored in the metadata database and should be kept small to avoid performance issues.
For large or sensitive data, use external storage and pass references via XComs, or implement custom backends.