MapReduce vs spark difference in hadoop

HadoopComparisonBeginner · 4 min read

MapReduce vs Spark in Hadoop: Key Differences and Usage

In Hadoop, MapReduce is a batch processing framework that writes intermediate data to disk, making it slower but reliable for large jobs. Spark is an in-memory processing engine that runs faster by keeping data in memory and supports real-time and iterative tasks.

⚖️

Quick Comparison

Here is a quick side-by-side comparison of MapReduce and Spark in Hadoop:

Factor	MapReduce	Spark
Processing Type	Batch processing	Batch and real-time processing
Speed	Slower due to disk I/O	Faster with in-memory computing
Ease of Use	Complex, verbose code	Simpler APIs with multiple languages
Fault Tolerance	High, via data replication	High, via lineage and RDDs
Memory Usage	Low, uses disk storage	High, uses RAM extensively
Use Cases	Large batch jobs	Iterative algorithms, streaming

⚖️

Key Differences

MapReduce works by splitting data into chunks, processing them in map and reduce steps, and writing intermediate results to disk. This makes it reliable but slower because of the heavy disk input/output operations.

Spark improves speed by keeping data in memory (RAM) during processing, which reduces disk reads and writes. It supports batch, streaming, and iterative tasks, making it more versatile.

While MapReduce requires writing more code and is limited to Java mostly, Spark offers easy-to-use APIs in Python, Scala, Java, and R. Both handle failures well, but Spark uses a concept called Resilient Distributed Datasets (RDDs) to recover lost data efficiently.

💻

MapReduce Code Example

java

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      String[] tokens = value.toString().split("\\s+");
      for (String token : tokens) {
        word.set(token);
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Output

word1 3 word2 5 word3 2 ...

↔️

Spark Equivalent

python

from pyspark import SparkContext

sc = SparkContext('local', 'WordCount')
text_file = sc.textFile('input.txt')
counts = text_file.flatMap(lambda line: line.split()) \
                  .map(lambda word: (word, 1)) \
                  .reduceByKey(lambda a, b: a + b)
counts.collect()

Output

[('word1', 3), ('word2', 5), ('word3', 2), ...]

🎯

When to Use Which

Choose MapReduce when you have very large batch jobs that can tolerate slower processing and require strong fault tolerance with disk-based storage. It is suitable for legacy Hadoop environments.

Choose Spark when you need faster processing, real-time streaming, or iterative algorithms like machine learning. Spark is better for interactive data analysis and supports multiple programming languages for easier development.

✅

Key Takeaways

Spark is faster than MapReduce because it processes data in memory instead of writing to disk.

MapReduce is reliable for large batch jobs but requires more complex code and is slower.

Spark supports real-time and iterative processing, making it more versatile for modern big data tasks.

Use MapReduce for legacy Hadoop batch jobs and Spark for faster, interactive, or streaming workloads.

Spark offers simpler APIs in multiple languages, improving developer productivity.