0
0
Apache Sparkdata~5 mins

Streaming joins in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Streaming joins
O(n * m)
Understanding Time Complexity

When working with streaming data, joining two streams efficiently is important.

We want to know how the time to join grows as the streams get bigger.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

val stream1 = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
val stream2 = spark.readStream.format("socket").option("host", "localhost").option("port", 9998).load()

val joinedStream = stream1.join(stream2, expr("stream1.key = stream2.key"))

val query = joinedStream.writeStream.format("console").start()
query.awaitTermination()

This code joins two streaming data sources on a key and outputs the joined results.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Matching keys from both streams continuously as new data arrives.
  • How many times: For each new batch of data, the join operation compares keys between the two streams.
How Execution Grows With Input

As more data arrives in each stream, the join operation must compare more keys.

Input Size (n)Approx. Operations
10About 100 key comparisons
100About 10,000 key comparisons
1000About 1,000,000 key comparisons

Pattern observation: The number of comparisons grows roughly with the product of the sizes of the two streams.

Final Time Complexity

Time Complexity: O(n * m)

This means the time to join grows roughly with the number of items in both streams multiplied together.

Common Mistake

[X] Wrong: "Joining two streams is always fast because data comes in small pieces."

[OK] Correct: Even small batches can add up, and the join compares keys across both streams, so time can grow quickly as data accumulates.

Interview Connect

Understanding how streaming joins scale helps you explain real-time data processing challenges clearly and confidently.

Self-Check

What if we added a time window to limit how far back the join looks? How would the time complexity change?