Overview - Output modes (append, complete, update)

What is it?

Output modes in Apache Spark define how the results of streaming computations are written to the output sink. There are three main modes: append, complete, and update. Each mode controls whether new rows, all rows, or only changed rows are written after each trigger. This helps manage how data is saved or displayed during continuous streaming.

Why it matters

Without output modes, streaming systems would struggle to efficiently and correctly update results as new data arrives. Output modes solve the problem of how to handle changing data in real time, ensuring that outputs reflect the latest state without unnecessary duplication or loss. This is crucial for real-time dashboards, alerts, and data pipelines that rely on accurate, timely information.

Where it fits

Learners should first understand basic Spark Structured Streaming concepts like streams, triggers, and sinks. After mastering output modes, they can explore advanced topics like stateful aggregations, watermarking, and fault tolerance in streaming.

Mental Model

Core Idea

Output modes control how streaming results are written by deciding whether to add only new data, rewrite all data, or update changed data after each batch.

Think of it like...

Imagine a live scoreboard at a sports game: append mode is like adding new scores as they happen, complete mode is like rewriting the entire scoreboard every time, and update mode is like changing only the scores that have changed since the last update.

┌───────────────┐
│ Streaming Job │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Output Modes  │
├───────────────┤
│ Append       │
│ Complete     │
│ Update       │
└─────┬─────────┘
      │
      ▼
┌───────────────┐
│ Output Sink   │
└───────────────┘

Build-Up - 8 Steps

1

FoundationUnderstanding Streaming Outputs

Concept: Streaming outputs are the results generated continuously as data flows through Spark Structured Streaming.

In Spark Structured Streaming, data arrives in small chunks called micro-batches. After processing each batch, Spark writes the results to an output destination like a file, database, or console. How these results are written depends on the output mode chosen.

Result

Learners understand that streaming produces ongoing results that need to be saved or displayed.

Knowing that streaming outputs are continuous helps grasp why special modes are needed to handle changing data efficiently.

2

FoundationWhat Are Output Modes?

3

IntermediateAppend Mode in Detail

4

IntermediateComplete Mode Explained

5

IntermediateUpdate Mode Explained

6

AdvancedChoosing Output Modes by Query Type

7

AdvancedOutput Modes and Sink Compatibility

8

ExpertInternal State Management and Output Modes

Under the Hood

Spark Structured Streaming processes data in micro-batches. After each batch, it computes the query result and writes output based on the mode. Append mode outputs only new rows detected by comparing current batch with previous state. Complete mode outputs the entire aggregated result by maintaining full state. Update mode tracks changes in state and outputs only updated rows. Internally, Spark uses state stores and checkpoints to manage this state and ensure fault tolerance.

Why designed this way?

These modes were designed to balance correctness, efficiency, and sink capabilities. Append mode is simple and efficient for append-only data. Complete mode ensures correctness for full aggregations but can be costly. Update mode offers a middle ground for incremental updates. Alternatives like rewriting entire datasets every time would be inefficient, and no output modes would cause inconsistent or incorrect streaming results.

┌───────────────┐
│ Micro-batch   │
│ Processing   │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ State Store   │<─────────────┐
│ (Aggregations)│              │
└──────┬────────┘              │
       │                       │
       ▼                       │
┌───────────────┐              │
│ Output Mode   │──────────────┤
│ Logic         │              │
└──────┬────────┘              │
       │                       │
       ▼                       │
┌───────────────┐              │
│ Output Sink   │              │
└───────────────┘              │
                              │
       ┌──────────────────────┘
       ▼
┌───────────────┐
│ Checkpointing │
│ & Fault Toler.│
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does append mode allow updating existing rows? Commit yes or no.

Common Belief:Append mode can update existing rows in the output.

Tap to reveal reality

Quick: Does complete mode output only changed rows? Commit yes or no.

Common Belief:Complete mode outputs only the rows that changed since the last batch.

Tap to reveal reality

Quick: Can all sinks support update mode? Commit yes or no.

Common Belief:All output sinks support update mode for streaming.

Tap to reveal reality

Quick: Does Spark keep all past data in memory for append mode? Commit yes or no.

Common Belief:Spark stores all past data in memory regardless of output mode.

Tap to reveal reality

Expert Zone

1

Update mode requires sinks that support idempotent writes or upserts to avoid data duplication or corruption.

2

Complete mode can cause high latency and resource use for large stateful aggregations, so it is often avoided in production.

3

Append mode is the only mode supported for streaming queries with event-time watermarks and late data handling without aggregation.

When NOT to use

Avoid append mode for queries with aggregations that update existing rows; use update or complete instead. Avoid complete mode for large stateful queries due to performance costs; consider update mode or incremental processing. If the sink does not support updates, do not use update mode; instead, use append or complete with compatible sinks.

Production Patterns

In production, append mode is common for simple event logging pipelines. Update mode is used with databases like Cassandra or Kafka sinks that support upserts. Complete mode is used for small or medium-sized aggregations powering dashboards where full state visibility is needed. Combining output modes with watermarking and state cleanup is a common pattern to manage resource use.

Connections

Event-time Watermarking

Builds-on

Understanding output modes helps grasp how watermarking controls late data handling and state eviction in streaming.

Database Upserts

Same pattern

Update mode in streaming is similar to database upserts, where only changed rows are updated, improving efficiency.

Real-time Dashboarding

Application domain

Choosing the right output mode directly impacts how real-time dashboards reflect live data accurately and efficiently.

Common Pitfalls

#1Using append mode with aggregation queries that update existing rows.

Wrong approach:streamingQuery.writeStream.outputMode("append").format("console").start()

Correct approach:streamingQuery.writeStream.outputMode("update").format("console").start()

Root cause:Misunderstanding that append mode cannot handle updates causes incorrect or failed streaming jobs.

#2Selecting update mode with a file sink that does not support updates.

Wrong approach:streamingQuery.writeStream.outputMode("update").format("parquet").start()

Correct approach:streamingQuery.writeStream.outputMode("append").format("parquet").start()

Root cause:Not knowing sink limitations leads to runtime errors or data corruption.

#3Using complete mode for very large aggregations causing high latency.

Wrong approach:streamingQuery.writeStream.outputMode("complete").format("console").start()

Correct approach:streamingQuery.writeStream.outputMode("update").format("console").start()

Root cause:Ignoring performance impact of rewriting full results causes slow streaming.

Key Takeaways

Output modes in Spark Structured Streaming control how results are written after each batch: append adds new rows, complete rewrites all rows, and update writes only changed rows.

Choosing the correct output mode depends on the query type and sink capabilities to ensure correct and efficient streaming.

Append mode is simple and efficient for new data only, complete mode suits full aggregations but can be costly, and update mode balances efficiency with incremental updates.

Understanding sink compatibility with output modes prevents runtime errors and data inconsistencies.

Knowing internal state management behind output modes helps optimize resource use and streaming performance.