What is MapReduce job execution flow in Hadoop?

Hadoopdata~5 mins

MapReduce job execution flow in Hadoop

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

MapReduce helps process big data by breaking tasks into small parts and running them on many computers at once.

When you have a huge amount of data to analyze, like logs from a website.

When you want to count words in many documents quickly.

When you need to sort or filter large datasets stored across many machines.

When you want to run data processing jobs that can be split into independent tasks.

When you want to use a simple programming model to handle complex data processing.

Syntax

Hadoop

MapReduce job runs in three main steps:
1. Map phase: Processes input data and creates key-value pairs.
2. Shuffle and Sort phase: Groups data by keys.
3. Reduce phase: Processes grouped data to produce final results.

The Map phase runs many tasks in parallel on chunks of data.

The Shuffle phase moves data between machines to group keys together.

Examples

This example counts how many times each word appears in a text.

Hadoop

Map phase: Read lines from a file and output (word, 1) for each word.
Shuffle phase: Group all (word, 1) pairs by word.
Reduce phase: Sum all counts for each word to get total occurrences.

This example summarizes user activity from log data.

Hadoop

Map phase: Extract user IDs and actions from logs.
Shuffle phase: Group actions by user ID.
Reduce phase: Summarize actions per user.

Sample Program

This Python program uses the MRJob library to count words in input data. The mapper splits lines into words and outputs each word with count 1. The reducer sums counts for each word.

Hadoop

from mrjob.job import MRJob

class MRWordCount(MRJob):
    def mapper(self, _, line):
        for word in line.split():
            yield word.lower(), 1

    def reducer(self, word, counts):
        yield word, sum(counts)

if __name__ == '__main__':
    MRWordCount.run()

OutputSuccess

Important Notes

The MapReduce framework handles splitting data and running tasks in parallel.

Shuffle and sort happen automatically between map and reduce phases.

Reducers receive all values for a key together to process them.

Summary

MapReduce breaks big data jobs into map and reduce tasks.

Map tasks process input and create key-value pairs.

Reduce tasks aggregate results by key to produce final output.