MapReduce helps process big data by breaking tasks into small parts and running them on many computers at once.
0
0
MapReduce job execution flow in Hadoop
Introduction
When you have a huge amount of data to analyze, like logs from a website.
When you want to count words in many documents quickly.
When you need to sort or filter large datasets stored across many machines.
When you want to run data processing jobs that can be split into independent tasks.
When you want to use a simple programming model to handle complex data processing.
Syntax
Hadoop
MapReduce job runs in three main steps: 1. Map phase: Processes input data and creates key-value pairs. 2. Shuffle and Sort phase: Groups data by keys. 3. Reduce phase: Processes grouped data to produce final results.
The Map phase runs many tasks in parallel on chunks of data.
The Shuffle phase moves data between machines to group keys together.
Examples
This example counts how many times each word appears in a text.
Hadoop
Map phase: Read lines from a file and output (word, 1) for each word. Shuffle phase: Group all (word, 1) pairs by word. Reduce phase: Sum all counts for each word to get total occurrences.
This example summarizes user activity from log data.
Hadoop
Map phase: Extract user IDs and actions from logs. Shuffle phase: Group actions by user ID. Reduce phase: Summarize actions per user.
Sample Program
This Python program uses the MRJob library to count words in input data. The mapper splits lines into words and outputs each word with count 1. The reducer sums counts for each word.
Hadoop
from mrjob.job import MRJob class MRWordCount(MRJob): def mapper(self, _, line): for word in line.split(): yield word.lower(), 1 def reducer(self, word, counts): yield word, sum(counts) if __name__ == '__main__': MRWordCount.run()
OutputSuccess
Important Notes
The MapReduce framework handles splitting data and running tasks in parallel.
Shuffle and sort happen automatically between map and reduce phases.
Reducers receive all values for a key together to process them.
Summary
MapReduce breaks big data jobs into map and reduce tasks.
Map tasks process input and create key-value pairs.
Reduce tasks aggregate results by key to produce final output.