What is word count in MapReduce in hadoop

HadoopConceptBeginner · 4 min read

Word Count in MapReduce in Hadoop: Explanation and Example

The word count in MapReduce in Hadoop is a simple program that counts how many times each word appears in a large set of text data. It uses Map to split text into words and Reduce to sum the counts for each word, enabling distributed processing of big data.

⚙️

How It Works

Imagine you have a huge book and want to know how many times each word appears. Doing this alone would take a long time. MapReduce in Hadoop solves this by splitting the book into many small parts and giving each part to different helpers (computers).

First, the Map step breaks each part into words and outputs each word with a count of one. Then, the Reduce step gathers all counts for the same word from all helpers and adds them up to get the total count.

This way, the work is shared and done faster, even if the text is very large. Hadoop manages this process automatically across many machines.

💻

Example

This example shows a simple MapReduce program in Java that counts words in Hadoop.

java

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Output

word 3 hadoop 2 mapreduce 2 example 1

🎯

When to Use

Use word count in MapReduce when you need to analyze large text data that is too big for one computer. It helps find the frequency of words in documents like books, logs, or social media posts.

This method is useful in search engines, text mining, and data analysis where counting words quickly and accurately across huge datasets is important.

✅

Key Points

Word count is a classic example to learn MapReduce basics.
Map step splits text into words and assigns count 1 to each.
Reduce step sums counts for each word from all mappers.
Hadoop runs this process distributed across many machines.
It works well for big data text analysis tasks.

✅

Key Takeaways

Word count in MapReduce counts how often each word appears in large text data using distributed computing.

The Map step breaks text into words and outputs each with count one.

The Reduce step sums all counts for each word to get total frequency.

Hadoop manages running MapReduce jobs across many machines for big data.

This technique is useful for text analysis in search engines, logs, and social media.