What is Combiner in MapReduce in Hadoop: Explanation and Example
combiner is a mini-reducer that runs after the map phase to perform local aggregation of data before sending it to the reducer. It helps reduce the amount of data transferred across the network, improving job efficiency.How It Works
Think of the combiner as a helper that cleans up and summarizes data right after the map step, but before the reduce step. Imagine you have many people counting words in different rooms of a building. Instead of sending all their raw counts to a central office, each room first sums up its own counts. This local summary is what the combiner does.
In MapReduce, after the map tasks produce key-value pairs, the combiner takes these pairs and combines values with the same key locally on the mapper node. This reduces the volume of data sent over the network to the reducers, which speeds up the overall process.
However, the combiner is optional and only works correctly if the reduce function is commutative and associative, meaning the order of combining does not change the result.
Example
This example shows a simple word count MapReduce job with a combiner that sums word counts locally before sending them to the reducer.
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class WordCountWithCombiner { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] tokens = value.toString().split("\\s+"); for (String token : tokens) { word.set(token); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count with combiner"); job.setJarByClass(WordCountWithCombiner.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); // Setting combiner job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
When to Use
Use a combiner when your reduce operation is commutative and associative, such as summing, counting, or finding minimum/maximum values. It helps reduce network traffic by summarizing data early.
For example, in word count jobs, a combiner can sum word counts locally on each mapper node before sending results to the reducer. This is especially useful when the input data is large and the intermediate data is huge.
However, do not use a combiner if it can change the final result or if the reduce function is not suitable for partial aggregation.
Key Points
- A
combineris an optional mini-reducer that runs after the map phase. - It performs local aggregation to reduce data sent to reducers.
- It improves performance by lowering network traffic.
- Only use it when the reduce function is commutative and associative.
- It does not guarantee to run; Hadoop may skip it.