Why Spark replaced MapReduce for big data in Apache Spark - Performance Analysis
We want to understand how the time to process big data changes with the size of data in Spark versus MapReduce.
Which system handles growing data faster and why?
Analyze the time complexity of this simplified Spark code compared to MapReduce.
val data = spark.read.textFile("bigdata.txt")
val words = data.rdd.flatMap(line => line.split(" ")).toDF("value")
val wordCounts = words.groupBy("value").count()
wordCounts.show()
This code reads big data, splits lines into words, groups by word, and counts occurrences.
Look at the main repeated steps:
- Primary operation: Splitting lines and grouping words to count.
- How many times: Once per data item (line or word), repeated over all data.
As data size grows, the number of words to process grows roughly the same.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 lines split and grouped |
| 100 | About 100 lines split and grouped |
| 1000 | About 1000 lines split and grouped |
Pattern observation: Operations grow roughly linearly with input size.
Time Complexity: O(n)
This means the time to process data grows directly with the amount of data.
[X] Wrong: "MapReduce and Spark have the same speed because they do the same steps."
[OK] Correct: Spark keeps data in memory between steps, so it avoids repeated slow disk reads that MapReduce does, making it faster as data grows.
Understanding how Spark improves time complexity helps you explain why it is preferred for big data tasks, showing your grasp of efficient data processing.
"What if Spark did not keep data in memory and wrote to disk after every step? How would the time complexity change?"