Word Count in MapReduce in Hadoop: Explanation and Example
word count in MapReduce in Hadoop is a simple program that counts how many times each word appears in a large set of text data. It uses Map to split text into words and Reduce to sum the counts for each word, enabling distributed processing of big data.How It Works
Imagine you have a huge book and want to know how many times each word appears. Doing this alone would take a long time. MapReduce in Hadoop solves this by splitting the book into many small parts and giving each part to different helpers (computers).
First, the Map step breaks each part into words and outputs each word with a count of one. Then, the Reduce step gathers all counts for the same word from all helpers and adds them up to get the total count.
This way, the work is shared and done faster, even if the text is very large. Hadoop manages this process automatically across many machines.
Example
This example shows a simple MapReduce program in Java that counts words in Hadoop.
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
When to Use
Use word count in MapReduce when you need to analyze large text data that is too big for one computer. It helps find the frequency of words in documents like books, logs, or social media posts.
This method is useful in search engines, text mining, and data analysis where counting words quickly and accurately across huge datasets is important.
Key Points
- Word count is a classic example to learn MapReduce basics.
- Map step splits text into words and assigns count 1 to each.
- Reduce step sums counts for each word from all mappers.
- Hadoop runs this process distributed across many machines.
- It works well for big data text analysis tasks.