0
0
Hadoopdata~30 mins

Hadoop in cloud (EMR, Dataproc, HDInsight) - Mini Project: Build & Apply

Choose your learning style9 modes available
Analyzing Sales Data Using Hadoop in Cloud (EMR, Dataproc, HDInsight)
📖 Scenario: You work for a retail company that wants to analyze its sales data stored in the cloud. The data is large, so you will use Hadoop services like AWS EMR, Google Dataproc, or Azure HDInsight to process it efficiently.This project guides you through setting up a simple sales dataset, configuring a threshold for high sales, processing the data using Hadoop MapReduce style logic, and finally outputting the filtered results.
🎯 Goal: Build a simple Hadoop-style data processing pipeline in Python that mimics how cloud Hadoop services process big data. You will create sales data, set a sales threshold, filter sales above the threshold, and print the results.
📋 What You'll Learn
Create a dictionary with sales data for products and their sales numbers
Add a sales threshold variable to filter high sales
Use a for loop to filter products with sales above the threshold
Print the filtered high sales products and their sales
💡 Why This Matters
🌍 Real World
Retail companies use cloud Hadoop services like EMR, Dataproc, or HDInsight to process large sales data quickly and find important trends.
💼 Career
Data analysts and engineers use these cloud tools to handle big data and extract useful insights for business decisions.
Progress0 / 4 steps
1
Create the sales data dictionary
Create a dictionary called sales_data with these exact entries: 'Laptop': 120, 'Smartphone': 250, 'Tablet': 90, 'Headphones': 150, 'Smartwatch': 80.
Hadoop
Need a hint?

Use curly braces {} to create a dictionary with product names as keys and sales numbers as values.

2
Set the sales threshold
Create a variable called sales_threshold and set it to 100 to filter products with sales above this number.
Hadoop
Need a hint?

Just assign the number 100 to the variable sales_threshold.

3
Filter products with sales above the threshold
Create an empty dictionary called high_sales. Use a for loop with variables product and sales to iterate over sales_data.items(). Inside the loop, add products with sales greater than sales_threshold to high_sales.
Hadoop
Need a hint?

Use for product, sales in sales_data.items(): to loop through the dictionary. Use an if statement to check sales.

4
Print the filtered high sales products
Write a print statement to display the high_sales dictionary.
Hadoop
Need a hint?

Use print(high_sales) to show the filtered results.