0
0
Hadoopdata~30 mins

Map phase explained in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
Map Phase Explained in Hadoop MapReduce
📖 Scenario: Imagine you work at a company that wants to count how many times each word appears in a large collection of documents. Hadoop MapReduce helps process this big data by breaking the job into smaller parts. The first part is called the Map phase, where each document is split into words and each word is paired with the number 1.
🎯 Goal: You will create a simple Python program that simulates the Map phase of Hadoop MapReduce. You will start by setting up the data, then configure a helper variable, apply the map logic to split text into words and pair each with 1, and finally print the mapped output.
📋 What You'll Learn
Create a dictionary called documents with three entries: 'doc1', 'doc2', and 'doc3', each containing a short sentence.
Create a variable called separator and set it to a space character.
Use a for loop with variables doc_id and text to iterate over documents.items().
Inside the loop, split the text using separator and create a list of tuples called mapped_words where each tuple is (word, 1).
Print the doc_id and the mapped_words list for each document.
💡 Why This Matters
🌍 Real World
The Map phase is the first step in processing large text data in Hadoop. It breaks down big files into smaller pieces and prepares data for counting or analysis.
💼 Career
Understanding the Map phase helps you work with big data tools like Hadoop and Spark, which are widely used in data engineering and data science jobs.
Progress0 / 4 steps
1
Set up the documents dictionary
Create a dictionary called documents with these exact entries: 'doc1': 'hello world', 'doc2': 'hello hadoop', and 'doc3': 'map reduce example'.
Hadoop
Need a hint?

Use curly braces {} to create a dictionary. Each key is a document ID string, and each value is a sentence string.

2
Create the separator variable
Create a variable called separator and set it to a single space character ' '.
Hadoop
Need a hint?

The separator is the character used to split sentences into words. Use a space ' '.

3
Apply the map logic to split and pair words
Use a for loop with variables doc_id and text to iterate over documents.items(). Inside the loop, split the text using separator and create a list called mapped_words where each element is a tuple (word, 1).
Hadoop
Need a hint?

Use text.split(separator) to get words. Then use a list comprehension to pair each word with 1.

4
Print the mapped output for each document
Inside the for loop, print the doc_id and the mapped_words list for each document.
Hadoop
Need a hint?

Use print(doc_id, mapped_words) inside the loop to show the output.