Spark vs Hadoop MapReduce in Apache Spark - Performance Comparison
We want to understand how the time to process data grows when using Spark compared to Hadoop MapReduce.
Which system handles bigger data faster and why?
Analyze the time complexity of this simplified Spark and Hadoop MapReduce code snippets.
// Spark example: in-memory data processing
val data = spark.read.textFile("data.txt")
val words = data.flatMap(line => line.split(" "))
val wordCounts = words.groupBy("value").count()
wordCounts.show()
// Hadoop MapReduce example: disk-based processing
// Mapper reads lines, emits words
// Reducer sums counts per word
Spark processes data in memory with transformations, while Hadoop MapReduce reads and writes to disk between steps.
Look at the main repeated work done by each system.
- Primary operation: Reading and processing each data record (line or word).
- How many times: Once per record in Spark (in memory), multiple times in Hadoop due to disk I/O between map and reduce.
As data size grows, Spark keeps data in memory, so operations scale roughly with data size.
| Input Size (n) | Approx. Operations Spark | Approx. Operations Hadoop |
|---|---|---|
| 10 | 10 reads + processing | 10 reads + disk writes + disk reads + processing |
| 100 | 100 reads + processing | 100 reads + disk writes + disk reads + processing |
| 1000 | 1000 reads + processing | 1000 reads + disk writes + disk reads + processing |
Pattern observation: Spark does fewer repeated disk operations, so it grows more smoothly with input size.
Time Complexity: O(n)
This means both systems process each data item once, but Spark avoids extra disk steps, making it faster in practice.
[X] Wrong: "Hadoop MapReduce and Spark have the same speed because they both process all data."
[OK] Correct: Hadoop writes intermediate results to disk, adding extra time, while Spark keeps data in memory, reducing repeated work.
Understanding how data processing systems handle large data helps you explain performance differences clearly and shows your grasp of practical data science tools.
"What if Spark had to write intermediate results to disk like Hadoop? How would the time complexity change?"