0
0
Apache Sparkdata~30 mins

Google Dataproc overview in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Google Dataproc Overview with Apache Spark
📖 Scenario: You work as a data analyst at a company that uses Google Cloud. Your team wants to analyze sales data using Apache Spark on Google Dataproc, a managed cloud service that makes big data processing easier and faster.Google Dataproc lets you create clusters of virtual machines to run Spark jobs without managing the infrastructure yourself.
🎯 Goal: Build a simple Spark program that simulates loading sales data, sets a configuration for a sales threshold, filters sales above that threshold, and prints the filtered results. This mimics how you would use Dataproc to process data efficiently.
📋 What You'll Learn
Create a dictionary with sales data for 5 products and their sales numbers
Create a variable for the sales threshold
Use a dictionary comprehension to filter products with sales above the threshold
Print the filtered dictionary
💡 Why This Matters
🌍 Real World
Google Dataproc helps companies run big data jobs on the cloud easily. Filtering sales data is a common task to find important insights quickly.
💼 Career
Data analysts and data engineers use Dataproc and Spark to process large datasets efficiently without managing servers.
Progress0 / 4 steps
1
Create the sales data dictionary
Create a dictionary called sales_data with these exact entries: 'Apples': 150, 'Bananas': 90, 'Cherries': 120, 'Dates': 60, 'Elderberries': 30
Apache Spark
Need a hint?

Use curly braces {} to create a dictionary with product names as keys and sales numbers as values.

2
Set the sales threshold
Create a variable called sales_threshold and set it to 100
Apache Spark
Need a hint?

Just assign the number 100 to the variable sales_threshold.

3
Filter sales above the threshold
Use a dictionary comprehension to create a new dictionary called filtered_sales that includes only products from sales_data with sales greater than sales_threshold
Apache Spark
Need a hint?

Use {product: sales for product, sales in sales_data.items() if sales > sales_threshold} to filter the dictionary.

4
Print the filtered sales
Print the filtered_sales dictionary
Apache Spark
Need a hint?

Use print(filtered_sales) to show the filtered dictionary.