Watermarking for late data in Apache Spark - Time & Space Complexity
We want to understand how the time to process streaming data changes when we use watermarking for late data.
Specifically, how does handling late data affect the work Spark does as data grows?
Analyze the time complexity of the following Apache Spark Structured Streaming code snippet.
streamingDF
.withWatermark("eventTime", "10 minutes")
.groupBy(window(col("eventTime"), "5 minutes"), col("category"))
.count()
This code groups streaming data by time windows and category, using a watermark to handle late data up to 10 minutes late.
Look at what repeats as new data arrives:
- Primary operation: Grouping data by time windows and categories repeatedly for each batch.
- How many times: Once per micro-batch, processing all new data plus any late data within watermark.
As more data arrives, Spark processes more rows each batch, including late data within the watermark.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | Processes about 10 rows per batch |
| 100 | Processes about 100 rows per batch |
| 1000 | Processes about 1000 rows per batch |
Pattern observation: The work grows roughly linearly with the number of rows processed each batch, including late data within the watermark.
Time Complexity: O(n)
This means the processing time grows roughly in direct proportion to the number of rows Spark processes each batch.
[X] Wrong: "Watermarking makes processing time constant no matter how much data arrives."
[OK] Correct: Watermarking limits how long late data is kept, but Spark still processes all data within that window, so time grows with data size.
Understanding how watermarking affects processing time helps you explain how streaming systems handle late data efficiently while scaling with input size.
What if we increased the watermark delay from 10 minutes to 1 hour? How would the time complexity change?