MapReduce vs Spark in Hadoop: Key Differences and Usage
MapReduce is a batch processing framework that writes intermediate data to disk, making it slower but reliable for large jobs. Spark is an in-memory processing engine that runs faster by keeping data in memory and supports real-time and iterative tasks.Quick Comparison
Here is a quick side-by-side comparison of MapReduce and Spark in Hadoop:
| Factor | MapReduce | Spark |
|---|---|---|
| Processing Type | Batch processing | Batch and real-time processing |
| Speed | Slower due to disk I/O | Faster with in-memory computing |
| Ease of Use | Complex, verbose code | Simpler APIs with multiple languages |
| Fault Tolerance | High, via data replication | High, via lineage and RDDs |
| Memory Usage | Low, uses disk storage | High, uses RAM extensively |
| Use Cases | Large batch jobs | Iterative algorithms, streaming |
Key Differences
MapReduce works by splitting data into chunks, processing them in map and reduce steps, and writing intermediate results to disk. This makes it reliable but slower because of the heavy disk input/output operations.
Spark improves speed by keeping data in memory (RAM) during processing, which reduces disk reads and writes. It supports batch, streaming, and iterative tasks, making it more versatile.
While MapReduce requires writing more code and is limited to Java mostly, Spark offers easy-to-use APIs in Python, Scala, Java, and R. Both handle failures well, but Spark uses a concept called Resilient Distributed Datasets (RDDs) to recover lost data efficiently.
MapReduce Code Example
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] tokens = value.toString().split("\\s+"); for (String token : tokens) { word.set(token); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Spark Equivalent
from pyspark import SparkContext sc = SparkContext('local', 'WordCount') text_file = sc.textFile('input.txt') counts = text_file.flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.collect()
When to Use Which
Choose MapReduce when you have very large batch jobs that can tolerate slower processing and require strong fault tolerance with disk-based storage. It is suitable for legacy Hadoop environments.
Choose Spark when you need faster processing, real-time streaming, or iterative algorithms like machine learning. Spark is better for interactive data analysis and supports multiple programming languages for easier development.