Transformations vs actions in Apache Spark - Performance Comparison
When working with Apache Spark, it is important to understand how time grows when using transformations and actions.
We want to see how the number of operations changes as data size grows for these two types of operations.
Analyze the time complexity of the following code snippet.
val data = spark.read.textFile("data.txt")
val words = data.flatMap(line => line.split(" "))
val filtered = words.filter(word => word.length > 3)
val counts = filtered.map(word => (word, 1)).reduceByKey(_ + _)
val result = counts.collect()
This code reads text, splits lines into words, filters words, counts them, and collects the result.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Each transformation (flatMap, filter, map, reduceByKey) processes all data elements once.
- How many times: Transformations are lazy and only run once when the action (collect) is called.
As input size grows, each transformation touches every element once before the action triggers execution.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 4 * 10 = 40 operations |
| 100 | About 4 * 100 = 400 operations |
| 1000 | About 4 * 1000 = 4000 operations |
Pattern observation: Operations grow roughly linearly with input size, multiplied by the number of transformations.
Time Complexity: O(n)
This means the total work grows in direct proportion to the amount of data processed.
[X] Wrong: "Transformations run immediately and multiply the time cost each time they appear."
[OK] Correct: Transformations are lazy and only run once when an action triggers execution, so they do not multiply time cost by themselves.
Understanding how transformations and actions affect execution time helps you explain Spark job performance clearly and confidently.
"What if we replaced collect() with count()? How would the time complexity change?"