Concept Flow - Spark vs Hadoop MapReduce
Input Data
Map Phase
Shuffle & Sort
Reduce Phase
Write Output
Final Result
Shows the flow of data processing in Hadoop MapReduce versus Apache Spark, highlighting disk-based steps versus in-memory operations.
data = sc.textFile('data.txt') words = data.flatMap(lambda line: line.split()) wordCounts = words.map(lambda w: (w,1)).reduceByKey(lambda a,b: a+b) wordCounts.collect()
| Step | Operation | Data Location | Action | Result |
|---|---|---|---|---|
| 1 | Read data from 'data.txt' | Disk | Load text file into RDD | RDD with lines of text |
| 2 | flatMap split lines | Memory | Split each line into words | RDD with words |
| 3 | map to (word,1) | Memory | Create pairs for counting | RDD of (word,1) tuples |
| 4 | reduceByKey sum counts | Memory | Sum counts for each word | RDD of (word, total_count) |
| 5 | collect results | Memory to Driver | Bring results to driver program | List of (word, count) pairs |
| Variable | Start | After Step 1 | After Step 2 | After Step 3 | After Step 4 | Final |
|---|---|---|---|---|---|---|
| data | empty | RDD(lines of text) | RDD(lines of text) | RDD(lines of text) | RDD(lines of text) | RDD(lines of text) |
| words | undefined | undefined | RDD(words) | RDD(words) | RDD(words) | RDD(words) |
| wordCounts | undefined | undefined | undefined | RDD((word,1)) | RDD((word, total_count)) | RDD((word, total_count)) |
| result | undefined | undefined | undefined | undefined | undefined | List((word, count)) |
Spark vs Hadoop MapReduce: - Hadoop MapReduce reads/writes intermediate data to disk. - Spark keeps data in memory for faster processing. - Spark uses lazy transformations; actions trigger execution. - MapReduce has map, shuffle/sort, reduce phases. - Spark uses RDD/DataFrame transformations and actions. - Spark is generally faster for iterative and interactive tasks.