0
0
Hadoopdata~30 mins

Reduce phase explained in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
Reduce phase explained
📖 Scenario: You are working with a big dataset of sales records. Each record has a product name and the number of units sold. You want to find the total units sold for each product.
🎯 Goal: Build a simple Hadoop Reduce phase that sums the units sold for each product.
📋 What You'll Learn
Create an input data structure with product names and units sold
Create a configuration variable for the minimum units threshold
Write the Reduce phase logic to sum units sold per product
Print the final summed units for each product
💡 Why This Matters
🌍 Real World
In real big data jobs, the Reduce phase combines data from many sources to get totals or summaries, like total sales per product.
💼 Career
Understanding the Reduce phase is key for data engineers and data scientists working with Hadoop or similar big data tools.
Progress0 / 4 steps
1
DATA SETUP: Create the sales data dictionary
Create a dictionary called sales_data with these exact entries: 'apple': 10, 'banana': 5, 'apple': 7, 'orange': 3, 'banana': 8. Use a list of tuples to represent multiple sales records.
Hadoop
Need a hint?

Use a list of tuples like [('apple', 10), ('banana', 5), ...]

2
CONFIGURATION: Set minimum units threshold
Create a variable called min_units and set it to 5 to filter products with total units sold less than this.
Hadoop
Need a hint?

Just write min_units = 5

3
CORE LOGIC: Sum units sold per product in Reduce phase
Create an empty dictionary called reduced_data. Use a for loop with variables product and units to iterate over sales_data. Add the units sold to reduced_data[product], initializing to 0 if the product is not yet in the dictionary.
Hadoop
Need a hint?

Use a dictionary to sum units. Check if product key exists before adding.

4
OUTPUT: Print products with total units above threshold
Use a for loop with variables product and total_units to iterate over reduced_data.items(). Inside the loop, use an if statement to check if total_units is greater than or equal to min_units. If yes, print the product and total units in the format: product: total_units.
Hadoop
Need a hint?

Print only products with total units 5 or more.