Hadoop vs Spark: Key Differences and When to Use Each
Hadoop and Spark is that Hadoop uses disk-based storage for processing data in batches, while Spark processes data in-memory for faster performance. Spark is better for real-time and iterative tasks, whereas Hadoop is suited for large-scale batch processing.Quick Comparison
Here is a quick side-by-side comparison of Hadoop and Spark on key factors.
| Factor | Hadoop | Spark |
|---|---|---|
| Processing Model | Batch processing using MapReduce | In-memory processing with DAG engine |
| Speed | Slower due to disk I/O | Faster due to in-memory computation |
| Ease of Use | Complex MapReduce programming | Simpler APIs in Java, Scala, Python |
| Data Handling | Processes data stored on HDFS | Processes data from HDFS and other sources |
| Real-time Processing | Not suitable | Supports real-time and streaming |
| Fault Tolerance | High, via data replication | High, via lineage and RDDs |
Key Differences
Hadoop is a framework that stores data on a distributed file system called HDFS and processes it using MapReduce, which reads and writes data to disk between each step. This makes it reliable but slower, especially for tasks that need multiple passes over data.
Spark, on the other hand, keeps data in memory as Resilient Distributed Datasets (RDDs) during processing. This reduces disk I/O and speeds up computations, making it ideal for iterative algorithms and real-time analytics.
While Hadoop requires writing complex MapReduce code, Spark offers easy-to-use APIs in several languages and supports SQL, machine learning, and graph processing libraries. Both handle fault tolerance differently: Hadoop replicates data blocks, while Spark rebuilds lost data using lineage information.
Code Comparison
Below is a simple example to count words in a text file using Hadoop MapReduce.
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] tokens = value.toString().split("\\s+"); for (String token : tokens) { word.set(token); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Spark Equivalent
Here is the equivalent word count example using Spark with Python (PySpark).
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('WordCount').getOrCreate() sc = spark.sparkContext text_file = sc.textFile('input.txt') counts = text_file.flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) for word, count in counts.collect(): print(f'{word}\t{count}') spark.stop()
When to Use Which
Choose Hadoop when you have very large datasets that require reliable batch processing and you can tolerate slower speeds, especially if your data is already stored in HDFS.
Choose Spark when you need faster processing, real-time analytics, or iterative machine learning tasks, and when you want simpler programming with multiple language support.
In many modern big data projects, Spark is preferred for its speed and flexibility, but Hadoop remains useful for heavy-duty batch jobs and storage.