0
0
Apache Sparkdata~10 mins

Spark vs Hadoop MapReduce in Apache Spark - Visual Side-by-Side Comparison

Choose your learning style9 modes available
Concept Flow - Spark vs Hadoop MapReduce
Input Data
Map Phase
Shuffle & Sort
Reduce Phase
Write Output
Final Result
Shows the flow of data processing in Hadoop MapReduce versus Apache Spark, highlighting disk-based steps versus in-memory operations.
Execution Sample
Apache Spark
data = sc.textFile('data.txt')
words = data.flatMap(lambda line: line.split())
wordCounts = words.map(lambda w: (w,1)).reduceByKey(lambda a,b: a+b)
wordCounts.collect()
Counts words in a text file using Spark's in-memory transformations and actions.
Execution Table
StepOperationData LocationActionResult
1Read data from 'data.txt'DiskLoad text file into RDDRDD with lines of text
2flatMap split linesMemorySplit each line into wordsRDD with words
3map to (word,1)MemoryCreate pairs for countingRDD of (word,1) tuples
4reduceByKey sum countsMemorySum counts for each wordRDD of (word, total_count)
5collect resultsMemory to DriverBring results to driver programList of (word, count) pairs
💡 All transformations are lazy; action 'collect' triggers execution and returns final counts.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4Final
dataemptyRDD(lines of text)RDD(lines of text)RDD(lines of text)RDD(lines of text)RDD(lines of text)
wordsundefinedundefinedRDD(words)RDD(words)RDD(words)RDD(words)
wordCountsundefinedundefinedundefinedRDD((word,1))RDD((word, total_count))RDD((word, total_count))
resultundefinedundefinedundefinedundefinedundefinedList((word, count))
Key Moments - 3 Insights
Why does Spark run faster than Hadoop MapReduce?
Spark keeps data in memory during processing (see execution_table steps 2-4), avoiding slow disk reads/writes after each step, unlike Hadoop MapReduce which writes intermediate results to disk.
What triggers the actual computation in Spark?
Transformations like flatMap and map are lazy and do not run immediately. The action 'collect' (step 5 in execution_table) triggers all previous steps to execute.
Why does Hadoop MapReduce have shuffle and sort between map and reduce?
Because Hadoop writes intermediate data to disk and needs to organize it for reducers, causing extra disk I/O and latency, unlike Spark which keeps data in memory.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, at which step does Spark actually perform the word count aggregation?
AStep 2: flatMap split lines
BStep 4: reduceByKey sum counts
CStep 3: map to (word,1)
DStep 5: collect results
💡 Hint
Check the 'Action' column in execution_table for where counts are summed.
According to variable_tracker, what is the state of 'words' after Step 3?
ARDD((word,1))
BRDD(lines of text)
CRDD(words)
DList((word, count))
💡 Hint
Look at the 'words' row and the 'After Step 3' column in variable_tracker.
If we remove the 'collect' action, what happens to the execution according to the exit_note?
ANo computation runs, transformations are lazy
BOnly map runs, reduce does not
CAll transformations run immediately
DData is written to disk
💡 Hint
Refer to the exit_note about lazy transformations and action triggering.
Concept Snapshot
Spark vs Hadoop MapReduce:
- Hadoop MapReduce reads/writes intermediate data to disk.
- Spark keeps data in memory for faster processing.
- Spark uses lazy transformations; actions trigger execution.
- MapReduce has map, shuffle/sort, reduce phases.
- Spark uses RDD/DataFrame transformations and actions.
- Spark is generally faster for iterative and interactive tasks.
Full Transcript
This visual execution compares Apache Spark and Hadoop MapReduce data processing flows. Hadoop MapReduce reads input from disk, maps data, shuffles and sorts intermediate results on disk, then reduces and writes output. Spark reads data into memory as RDDs, applies transformations like flatMap and map lazily, then performs reduceByKey in memory. The action 'collect' triggers execution and returns results. Variables like 'data', 'words', and 'wordCounts' change state step-by-step in memory. Key points include Spark's speed advantage due to in-memory processing and lazy evaluation. The quiz tests understanding of when aggregation happens, variable states, and the role of actions in Spark.