Apache Sparkdata~5 mins

Watermarking for late data in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Time Complexity: Watermarking for late data

O(n)

Understanding Time Complexity

We want to understand how the time to process streaming data changes when we use watermarking for late data.

Specifically, how does handling late data affect the work Spark does as data grows?

Scenario Under Consideration

Analyze the time complexity of the following Apache Spark Structured Streaming code snippet.

streamingDF
  .withWatermark("eventTime", "10 minutes")
  .groupBy(window(col("eventTime"), "5 minutes"), col("category"))
  .count()

This code groups streaming data by time windows and category, using a watermark to handle late data up to 10 minutes late.

Identify Repeating Operations

Look at what repeats as new data arrives:

Primary operation: Grouping data by time windows and categories repeatedly for each batch.
How many times: Once per micro-batch, processing all new data plus any late data within watermark.

How Execution Grows With Input

As more data arrives, Spark processes more rows each batch, including late data within the watermark.

Input Size (n)	Approx. Operations
10	Processes about 10 rows per batch
100	Processes about 100 rows per batch
1000	Processes about 1000 rows per batch

Pattern observation: The work grows roughly linearly with the number of rows processed each batch, including late data within the watermark.

Final Time Complexity

Time Complexity: O(n)

This means the processing time grows roughly in direct proportion to the number of rows Spark processes each batch.

Common Mistake

[X] Wrong: "Watermarking makes processing time constant no matter how much data arrives."

[OK] Correct: Watermarking limits how long late data is kept, but Spark still processes all data within that window, so time grows with data size.

Interview Connect

Understanding how watermarking affects processing time helps you explain how streaming systems handle late data efficiently while scaling with input size.

Self-Check

What if we increased the watermark delay from 10 minutes to 1 hour? How would the time complexity change?