0
0
Hadoopdata~30 mins

MapReduce job execution flow in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
MapReduce Job Execution Flow
📖 Scenario: You are working with a large dataset of sales records stored in Hadoop. You want to understand how a MapReduce job processes this data step-by-step to calculate total sales per product.
🎯 Goal: Build a simple MapReduce job execution flow using Python dictionaries and lists to simulate the key steps: input data setup, configuration of a threshold, mapping sales data, and outputting total sales per product.
📋 What You'll Learn
Create a dictionary with sales data for products and their sales amounts
Add a sales threshold variable to filter products
Use a loop to sum sales per product only if sales exceed the threshold
Print the final dictionary of products with total sales above the threshold
💡 Why This Matters
🌍 Real World
MapReduce jobs process large datasets by splitting tasks into map and reduce phases. This project simulates how data flows and is filtered in such jobs.
💼 Career
Understanding MapReduce execution flow is essential for data engineers and data scientists working with big data platforms like Hadoop.
Progress0 / 4 steps
1
DATA SETUP: Create sales data dictionary
Create a dictionary called sales_data with these exact entries: 'apple': 120, 'banana': 80, 'orange': 150, 'grape': 60, 'mango': 200.
Hadoop
Need a hint?

Use curly braces to create a dictionary with product names as keys and sales amounts as values.

2
CONFIGURATION: Set sales threshold
Create a variable called sales_threshold and set it to 100.
Hadoop
Need a hint?

Just assign the number 100 to the variable sales_threshold.

3
CORE LOGIC: Filter and sum sales above threshold
Create an empty dictionary called filtered_sales. Use a for loop with variables product and amount to iterate over sales_data.items(). Inside the loop, add the product and amount to filtered_sales only if amount is greater than sales_threshold.
Hadoop
Need a hint?

Use a for loop to check each product's sales and add to filtered_sales if above threshold.

4
OUTPUT: Print filtered sales dictionary
Write print(filtered_sales) to display the dictionary of products with sales above the threshold.
Hadoop
Need a hint?

Use print() to show the filtered_sales dictionary.