Hadoopdata~30 mins

MapReduce job execution flow in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

MapReduce Job Execution Flow

📖 Scenario: You are working with a large dataset of sales records stored in Hadoop. You want to understand how a MapReduce job processes this data step-by-step to calculate total sales per product.

🎯 Goal: Build a simple MapReduce job execution flow using Python dictionaries and lists to simulate the key steps: input data setup, configuration of a threshold, mapping sales data, and outputting total sales per product.

📋 What You'll Learn

Create a dictionary with sales data for products and their sales amounts

Add a sales threshold variable to filter products

Use a loop to sum sales per product only if sales exceed the threshold

Print the final dictionary of products with total sales above the threshold

💡 Why This Matters

🌍 Real World

MapReduce jobs process large datasets by splitting tasks into map and reduce phases. This project simulates how data flows and is filtered in such jobs.

💼 Career

Understanding MapReduce execution flow is essential for data engineers and data scientists working with big data platforms like Hadoop.

Progress0 / 4 steps

DATA SETUP: Create sales data dictionary

Create a dictionary called sales_data with these exact entries: 'apple': 120, 'banana': 80, 'orange': 150, 'grape': 60, 'mango': 200.

Hadoop

# Create the sales_data dictionary with product names and sales amounts
# Your code here

Need a hint?

Use curly braces to create a dictionary with product names as keys and sales amounts as values.

CONFIGURATION: Set sales threshold

Create a variable called sales_threshold and set it to 100.

Hadoop

sales_data = {'apple': 120, 'banana': 80, 'orange': 150, 'grape': 60, 'mango': 200}
# Create sales_threshold variable and set it to 100
# Your code here

Need a hint?

Just assign the number 100 to the variable sales_threshold.

CORE LOGIC: Filter and sum sales above threshold

Create an empty dictionary called filtered_sales. Use a for loop with variables product and amount to iterate over sales_data.items(). Inside the loop, add the product and amount to filtered_sales only if amount is greater than sales_threshold.

Hadoop

sales_data = {'apple': 120, 'banana': 80, 'orange': 150, 'grape': 60, 'mango': 200}
sales_threshold = 100
# Create filtered_sales dictionary and add products with sales above sales_threshold
# Your code here

Need a hint?

Use a for loop to check each product's sales and add to filtered_sales if above threshold.

OUTPUT: Print filtered sales dictionary

Write print(filtered_sales) to display the dictionary of products with sales above the threshold.

Hadoop

sales_data = {'apple': 120, 'banana': 80, 'orange': 150, 'grape': 60, 'mango': 200}
sales_threshold = 100
filtered_sales = {}
for product, amount in sales_data.items():
    if amount > sales_threshold:
        filtered_sales[product] = amount
# Print the filtered_sales dictionary
# Your code here

Need a hint?

Use print() to show the filtered_sales dictionary.