Streaming joins in Apache Spark - Time & Space Complexity
When working with streaming data, joining two streams efficiently is important.
We want to know how the time to join grows as the streams get bigger.
Analyze the time complexity of the following code snippet.
val stream1 = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
val stream2 = spark.readStream.format("socket").option("host", "localhost").option("port", 9998).load()
val joinedStream = stream1.join(stream2, expr("stream1.key = stream2.key"))
val query = joinedStream.writeStream.format("console").start()
query.awaitTermination()
This code joins two streaming data sources on a key and outputs the joined results.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Matching keys from both streams continuously as new data arrives.
- How many times: For each new batch of data, the join operation compares keys between the two streams.
As more data arrives in each stream, the join operation must compare more keys.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 100 key comparisons |
| 100 | About 10,000 key comparisons |
| 1000 | About 1,000,000 key comparisons |
Pattern observation: The number of comparisons grows roughly with the product of the sizes of the two streams.
Time Complexity: O(n * m)
This means the time to join grows roughly with the number of items in both streams multiplied together.
[X] Wrong: "Joining two streams is always fast because data comes in small pieces."
[OK] Correct: Even small batches can add up, and the join compares keys across both streams, so time can grow quickly as data accumulates.
Understanding how streaming joins scale helps you explain real-time data processing challenges clearly and confidently.
What if we added a time window to limit how far back the join looks? How would the time complexity change?