Hadoopdata~30 mins

Map phase explained in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Map Phase Explained in Hadoop MapReduce

📖 Scenario: Imagine you work at a company that wants to count how many times each word appears in a large collection of documents. Hadoop MapReduce helps process this big data by breaking the job into smaller parts. The first part is called the Map phase, where each document is split into words and each word is paired with the number 1.

🎯 Goal: You will create a simple Python program that simulates the Map phase of Hadoop MapReduce. You will start by setting up the data, then configure a helper variable, apply the map logic to split text into words and pair each with 1, and finally print the mapped output.

📋 What You'll Learn

Create a dictionary called documents with three entries: 'doc1', 'doc2', and 'doc3', each containing a short sentence.

Create a variable called separator and set it to a space character.

Use a for loop with variables doc_id and text to iterate over documents.items().

Inside the loop, split the text using separator and create a list of tuples called mapped_words where each tuple is (word, 1).

Print the doc_id and the mapped_words list for each document.

💡 Why This Matters

🌍 Real World

The Map phase is the first step in processing large text data in Hadoop. It breaks down big files into smaller pieces and prepares data for counting or analysis.

💼 Career

Understanding the Map phase helps you work with big data tools like Hadoop and Spark, which are widely used in data engineering and data science jobs.

Progress0 / 4 steps

Set up the documents dictionary

Create a dictionary called documents with these exact entries: 'doc1': 'hello world', 'doc2': 'hello hadoop', and 'doc3': 'map reduce example'.

Hadoop

# Create the documents dictionary with exact entries
# Your code here

Need a hint?

Use curly braces {} to create a dictionary. Each key is a document ID string, and each value is a sentence string.

Create the separator variable

Create a variable called separator and set it to a single space character ' '.

Hadoop

documents = {
    'doc1': 'hello world',
    'doc2': 'hello hadoop',
    'doc3': 'map reduce example'
}
# Create the separator variable as a space character
# Your code here

Need a hint?

The separator is the character used to split sentences into words. Use a space ' '.

Apply the map logic to split and pair words

Use a for loop with variables doc_id and text to iterate over documents.items(). Inside the loop, split the text using separator and create a list called mapped_words where each element is a tuple (word, 1).

Hadoop

documents = {
    'doc1': 'hello world',
    'doc2': 'hello hadoop',
    'doc3': 'map reduce example'
}
separator = ' '

# Use a for loop to iterate over documents and create mapped_words list
# Your code here

Need a hint?

Use text.split(separator) to get words. Then use a list comprehension to pair each word with 1.

Print the mapped output for each document

Inside the for loop, print the doc_id and the mapped_words list for each document.

Hadoop

documents = {
    'doc1': 'hello world',
    'doc2': 'hello hadoop',
    'doc3': 'map reduce example'
}
separator = ' '

for doc_id, text in documents.items():
    words = text.split(separator)
    mapped_words = [(word, 1) for word in words]
    # Print the doc_id and mapped_words
    # Your code here

Need a hint?

Use print(doc_id, mapped_words) inside the loop to show the output.