HadoopComparisonBeginner · 4 min read

Hadoop vs Spark: Key Differences and When to Use Each

Hadoop is a framework for distributed storage and batch processing using MapReduce, while Spark is a fast, in-memory data processing engine that supports batch and real-time analytics. Spark is generally faster and easier to use than Hadoop MapReduce but Hadoop excels in large-scale storage with its HDFS system.

⚖️

Quick Comparison

Here is a quick side-by-side comparison of Hadoop and Spark on key factors.

Factor	Hadoop	Spark
Processing Model	Batch processing with MapReduce	In-memory batch and stream processing
Speed	Slower due to disk I/O	Faster due to in-memory computation
Ease of Use	Complex, requires Java coding	Simpler APIs in Scala, Python, Java, R
Storage	Uses HDFS for distributed storage	Can use HDFS or other storage systems
Fault Tolerance	High, via data replication in HDFS	High, via RDD lineage and data replication
Use Cases	Large-scale batch jobs, ETL	Real-time analytics, iterative algorithms

⚖️

Key Differences

Hadoop uses the MapReduce programming model that writes intermediate data to disk, making it slower but reliable for batch jobs. It relies on HDFS for distributed storage, which replicates data across nodes for fault tolerance.

Spark improves speed by keeping data in memory during processing, which is ideal for iterative tasks like machine learning and real-time analytics. It supports multiple languages and provides higher-level APIs, making it easier to write complex workflows.

While Hadoop is great for storing massive datasets and running long batch jobs, Spark is preferred when speed and ease of development are priorities, especially for streaming data and interactive queries.

⚖️

Code Comparison

Here is a simple example of counting words in a text file using Hadoop MapReduce in Java.

java

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Output

word1 3 word2 5 word3 2 ... (word counts)

↔️

Spark Equivalent

Here is the equivalent word count example using Apache Spark with Python (PySpark).

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('WordCount').getOrCreate()
sc = spark.sparkContext

text_file = sc.textFile('input.txt')
counts = text_file.flatMap(lambda line: line.split()) \
                  .map(lambda word: (word, 1)) \
                  .reduceByKey(lambda a, b: a + b)

for word, count in counts.collect():
    print(f'{word}\t{count}')

spark.stop()

Output

word1 3 word2 5 word3 2 ... (word counts)

🎯

When to Use Which

Choose Hadoop when you need reliable, large-scale storage with batch processing of massive datasets and your jobs are not time-sensitive. Hadoop's HDFS is excellent for storing huge amounts of data across many machines.

Choose Spark when you want faster processing, especially for iterative algorithms, real-time data streams, or interactive data analysis. Spark's in-memory computing and easy-to-use APIs speed up development and execution.

✅

Key Takeaways

Spark is faster than Hadoop due to in-memory processing.

Hadoop excels in distributed storage with HDFS for large datasets.

Spark supports both batch and real-time data processing.

Hadoop MapReduce requires more complex coding compared to Spark APIs.

Use Hadoop for massive batch jobs and Spark for speed and ease of use.