Hadoop vs Spark: Key Differences and When to Use Each
Hadoop is a framework for distributed storage and batch processing using MapReduce, while Spark is a fast, in-memory data processing engine that supports batch and real-time analytics. Spark is generally faster and easier to use than Hadoop MapReduce but Hadoop excels in large-scale storage with its HDFS system.Quick Comparison
Here is a quick side-by-side comparison of Hadoop and Spark on key factors.
| Factor | Hadoop | Spark |
|---|---|---|
| Processing Model | Batch processing with MapReduce | In-memory batch and stream processing |
| Speed | Slower due to disk I/O | Faster due to in-memory computation |
| Ease of Use | Complex, requires Java coding | Simpler APIs in Scala, Python, Java, R |
| Storage | Uses HDFS for distributed storage | Can use HDFS or other storage systems |
| Fault Tolerance | High, via data replication in HDFS | High, via RDD lineage and data replication |
| Use Cases | Large-scale batch jobs, ETL | Real-time analytics, iterative algorithms |
Key Differences
Hadoop uses the MapReduce programming model that writes intermediate data to disk, making it slower but reliable for batch jobs. It relies on HDFS for distributed storage, which replicates data across nodes for fault tolerance.
Spark improves speed by keeping data in memory during processing, which is ideal for iterative tasks like machine learning and real-time analytics. It supports multiple languages and provides higher-level APIs, making it easier to write complex workflows.
While Hadoop is great for storing massive datasets and running long batch jobs, Spark is preferred when speed and ease of development are priorities, especially for streaming data and interactive queries.
Code Comparison
Here is a simple example of counting words in a text file using Hadoop MapReduce in Java.
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Spark Equivalent
Here is the equivalent word count example using Apache Spark with Python (PySpark).
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('WordCount').getOrCreate() sc = spark.sparkContext text_file = sc.textFile('input.txt') counts = text_file.flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) for word, count in counts.collect(): print(f'{word}\t{count}') spark.stop()
When to Use Which
Choose Hadoop when you need reliable, large-scale storage with batch processing of massive datasets and your jobs are not time-sensitive. Hadoop's HDFS is excellent for storing huge amounts of data across many machines.
Choose Spark when you want faster processing, especially for iterative algorithms, real-time data streams, or interactive data analysis. Spark's in-memory computing and easy-to-use APIs speed up development and execution.