Structured Streaming basics in Apache Spark - Time & Space Complexity
We want to understand how the time it takes to process data grows as the amount of streaming data increases.
How does the processing time change when more data arrives in a stream?
Analyze the time complexity of the following Structured Streaming code snippet.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("streaming_example").getOrCreate()
streamingDF = spark.readStream.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
query = streamingDF.writeStream.format("console").start()
query.awaitTermination()
This code reads streaming data from a socket and writes it to the console as it arrives.
- Primary operation: Processing each batch of streaming data as it arrives.
- How many times: Once per micro-batch, repeating continuously as new data comes in.
Each micro-batch processes all new data received since the last batch.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 records | Processes 10 records |
| 100 records | Processes 100 records |
| 1000 records | Processes 1000 records |
Pattern observation: The processing time grows roughly in direct proportion to the number of new records in each batch.
Time Complexity: O(n)
This means the time to process each batch grows linearly with the number of new records in that batch.
[X] Wrong: "The streaming job processes all data from the start every time, so time grows much faster."
[OK] Correct: Structured Streaming processes only new data in each micro-batch, not all past data again, so time grows with new data size, not total data size.
Understanding how streaming data processing time grows helps you design efficient real-time systems and shows you can reason about continuous data flows.
"What if we changed the trigger interval to process smaller batches more often? How would the time complexity per batch change?"