0
0
HadoopComparisonBeginner · 4 min read

Hadoop vs Spark: Key Differences and When to Use Each

The main difference between Hadoop and Spark is that Hadoop uses disk-based storage for processing data in batches, while Spark processes data in-memory for faster performance. Spark is better for real-time and iterative tasks, whereas Hadoop is suited for large-scale batch processing.
⚖️

Quick Comparison

Here is a quick side-by-side comparison of Hadoop and Spark on key factors.

FactorHadoopSpark
Processing ModelBatch processing using MapReduceIn-memory processing with DAG engine
SpeedSlower due to disk I/OFaster due to in-memory computation
Ease of UseComplex MapReduce programmingSimpler APIs in Java, Scala, Python
Data HandlingProcesses data stored on HDFSProcesses data from HDFS and other sources
Real-time ProcessingNot suitableSupports real-time and streaming
Fault ToleranceHigh, via data replicationHigh, via lineage and RDDs
⚖️

Key Differences

Hadoop is a framework that stores data on a distributed file system called HDFS and processes it using MapReduce, which reads and writes data to disk between each step. This makes it reliable but slower, especially for tasks that need multiple passes over data.

Spark, on the other hand, keeps data in memory as Resilient Distributed Datasets (RDDs) during processing. This reduces disk I/O and speeds up computations, making it ideal for iterative algorithms and real-time analytics.

While Hadoop requires writing complex MapReduce code, Spark offers easy-to-use APIs in several languages and supports SQL, machine learning, and graph processing libraries. Both handle fault tolerance differently: Hadoop replicates data blocks, while Spark rebuilds lost data using lineage information.

⚖️

Code Comparison

Below is a simple example to count words in a text file using Hadoop MapReduce.

java
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      String[] tokens = value.toString().split("\\s+");
      for (String token : tokens) {
        word.set(token);
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
Output
word1 3 word2 5 word3 2 ...
↔️

Spark Equivalent

Here is the equivalent word count example using Spark with Python (PySpark).

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('WordCount').getOrCreate()
sc = spark.sparkContext

text_file = sc.textFile('input.txt')
counts = text_file.flatMap(lambda line: line.split()) \
                  .map(lambda word: (word, 1)) \
                  .reduceByKey(lambda a, b: a + b)

for word, count in counts.collect():
    print(f'{word}\t{count}')

spark.stop()
Output
word1 3 word2 5 word3 2 ...
🎯

When to Use Which

Choose Hadoop when you have very large datasets that require reliable batch processing and you can tolerate slower speeds, especially if your data is already stored in HDFS.

Choose Spark when you need faster processing, real-time analytics, or iterative machine learning tasks, and when you want simpler programming with multiple language support.

In many modern big data projects, Spark is preferred for its speed and flexibility, but Hadoop remains useful for heavy-duty batch jobs and storage.

Key Takeaways

Spark processes data in-memory, making it much faster than Hadoop's disk-based MapReduce.
Hadoop is better for large-scale batch jobs with high fault tolerance via data replication.
Spark offers simpler APIs and supports real-time and iterative processing.
Use Hadoop when data size is huge and batch processing is acceptable.
Use Spark for speed, real-time analytics, and machine learning tasks.