Lazy evaluation in Spark in Apache Spark - Time & Space Complexity
Lazy evaluation means Spark waits to run code until it really needs to. This affects how long your program takes to run.
We want to know how the time to run grows as the data size grows when using lazy evaluation.
Analyze the time complexity of the following code snippet.
val data = spark.read.textFile("data.txt")
val filtered = data.filter(line => line.contains("error"))
val mapped = filtered.map(line => line.toUpperCase())
mapped.count()
This code reads a file, filters lines with "error", converts them to uppercase, then counts them. Spark delays running until count() is called.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Spark applies filter and map transformations on each line of the input data.
- How many times: Each line is processed once during the final action (count).
As the number of lines grows, Spark processes each line once when count() runs.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 lines filtered and mapped once |
| 100 | About 100 lines filtered and mapped once |
| 1000 | About 1000 lines filtered and mapped once |
Pattern observation: The work grows directly with the number of lines. More lines mean more work, but each line is handled only once.
Time Complexity: O(n)
This means the time to run grows in a straight line with the number of input lines.
[X] Wrong: "Spark runs all transformations immediately as they are written."
[OK] Correct: Spark waits until an action like count() to run all steps together, so the time cost happens once, not multiple times.
Understanding lazy evaluation helps you explain how Spark handles big data efficiently. It shows you know how Spark saves time by delaying work until needed.
"What if we replaced count() with collect()? How would the time complexity change?"