0
0
Hadoopdata~30 mins

GROUP and JOIN operations in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
GROUP and JOIN operations
📖 Scenario: You work at a small online store. You have two data files: one with customer orders and another with customer details. You want to learn how to group orders by customer and then join customer details with their orders.
🎯 Goal: Build a Hadoop MapReduce job that groups orders by customer ID and then joins customer details with their orders to produce a combined output showing customer name and their orders.
📋 What You'll Learn
Create a dataset of orders with customer IDs and order amounts
Create a dataset of customers with customer IDs and names
Write a MapReduce job to group orders by customer ID
Write a MapReduce job to join customer details with their orders
Print the final joined output showing customer names and their orders
💡 Why This Matters
🌍 Real World
Grouping and joining data is common in sales analysis, customer segmentation, and reporting in many businesses.
💼 Career
Data scientists and engineers often need to group and join large datasets to prepare data for analysis or machine learning.
Progress0 / 4 steps
1
Create the orders dataset
Create a list called orders with these exact entries: ("C1", 100), ("C2", 150), ("C1", 200), ("C3", 300).
Hadoop
Need a hint?

Use a Python list of tuples to store the orders exactly as shown.

2
Create the customers dataset
Create a dictionary called customers with these exact entries: 'C1': 'Alice', 'C2': 'Bob', 'C3': 'Charlie'.
Hadoop
Need a hint?

Use a Python dictionary to map customer IDs to names exactly as shown.

3
Group orders by customer ID
Create a dictionary called grouped_orders that groups order amounts by customer ID from the orders list. Use a for loop with variables cust_id and amount to iterate over orders. Append amounts to the list for each customer ID.
Hadoop
Need a hint?

Use a dictionary to collect lists of amounts for each customer ID.

4
Join customers with their grouped orders and print
Create a list called joined_data that contains tuples of customer name and their list of order amounts by joining customers and grouped_orders. Use a for loop with variable cust_id to iterate over grouped_orders. Then print joined_data.
Hadoop
Need a hint?

Use a loop to create a list of tuples with customer names and their orders, then print it.