0
0
HadoopHow-ToBeginner ยท 4 min read

How MapReduce Works in Hadoop: Process and Example

In Hadoop, MapReduce processes large data by splitting it into chunks, mapping each chunk to key-value pairs, then reducing those pairs to summarize results. The Map step filters and sorts data, while the Reduce step aggregates the output to produce the final result.
๐Ÿ“

Syntax

The basic structure of a MapReduce program in Hadoop includes two main functions: map() and reduce(). The map() function processes input data and outputs key-value pairs. The reduce() function takes these pairs, groups them by key, and processes them to produce the final output.

  • map(key, value): Processes input and emits intermediate key-value pairs.
  • reduce(key, list_of_values): Aggregates values for each key and emits final results.
java
public class WordCount {
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
}
๐Ÿ’ป

Example

This example counts the number of times each word appears in a text file using Hadoop MapReduce. The map() function splits lines into words and emits each word with a count of 1. The reduce() function sums all counts for each word to get the total occurrences.

java
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
Output
word1 3 word2 5 word3 2
โš ๏ธ

Common Pitfalls

  • Incorrect key-value types: Using wrong data types in map() or reduce() causes runtime errors.
  • Not handling data splitting properly: Input data must be split correctly for parallel processing.
  • Forgetting to set output key/value classes: Hadoop requires explicit setting of output types.
  • Not using Combiner: Missing combiner can cause inefficient data transfer between map and reduce.
java
/* Wrong: Missing output key/value class settings */
job.setMapperClass(TokenizerMapper.class);
// Missing job.setOutputKeyClass and job.setOutputValueClass

/* Right: Set output key/value classes explicitly */
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
๐Ÿ“Š

Quick Reference

MapReduce Workflow Steps:

  • Input Split: Data is split into chunks.
  • Map: Processes each chunk to produce key-value pairs.
  • Shuffle and Sort: Groups key-value pairs by key.
  • Reduce: Aggregates values for each key.
  • Output: Writes final results to storage.
โœ…

Key Takeaways

MapReduce splits data, maps it to key-value pairs, then reduces to aggregate results.
The map() function processes input data; reduce() aggregates mapped data by key.
Always set correct output key and value classes in your Hadoop job configuration.
Use combiners to optimize data transfer between map and reduce phases.
MapReduce handles large data by parallel processing and sorting before reduction.