0
0
Apache Sparkdata~5 mins

Structured Streaming basics in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Structured Streaming basics
O(n)
Understanding Time Complexity

We want to understand how the time it takes to process data grows as the amount of streaming data increases.

How does the processing time change when more data arrives in a stream?

Scenario Under Consideration

Analyze the time complexity of the following Structured Streaming code snippet.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("streaming_example").getOrCreate()

streamingDF = spark.readStream.format("socket") \
  .option("host", "localhost") \
  .option("port", 9999) \
  .load()

query = streamingDF.writeStream.format("console").start()
query.awaitTermination()

This code reads streaming data from a socket and writes it to the console as it arrives.

Identify Repeating Operations
  • Primary operation: Processing each batch of streaming data as it arrives.
  • How many times: Once per micro-batch, repeating continuously as new data comes in.
How Execution Grows With Input

Each micro-batch processes all new data received since the last batch.

Input Size (n)Approx. Operations
10 recordsProcesses 10 records
100 recordsProcesses 100 records
1000 recordsProcesses 1000 records

Pattern observation: The processing time grows roughly in direct proportion to the number of new records in each batch.

Final Time Complexity

Time Complexity: O(n)

This means the time to process each batch grows linearly with the number of new records in that batch.

Common Mistake

[X] Wrong: "The streaming job processes all data from the start every time, so time grows much faster."

[OK] Correct: Structured Streaming processes only new data in each micro-batch, not all past data again, so time grows with new data size, not total data size.

Interview Connect

Understanding how streaming data processing time grows helps you design efficient real-time systems and shows you can reason about continuous data flows.

Self-Check

"What if we changed the trigger interval to process smaller batches more often? How would the time complexity per batch change?"