Apache Sparkdata~30 mins

Google Dataproc overview in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Google Dataproc Overview with Apache Spark

📖 Scenario: You work as a data analyst at a company that uses Google Cloud. Your team wants to analyze sales data using Apache Spark on Google Dataproc, a managed cloud service that makes big data processing easier and faster.Google Dataproc lets you create clusters of virtual machines to run Spark jobs without managing the infrastructure yourself.

🎯 Goal: Build a simple Spark program that simulates loading sales data, sets a configuration for a sales threshold, filters sales above that threshold, and prints the filtered results. This mimics how you would use Dataproc to process data efficiently.

📋 What You'll Learn

Create a dictionary with sales data for 5 products and their sales numbers

Create a variable for the sales threshold

Use a dictionary comprehension to filter products with sales above the threshold

Print the filtered dictionary

💡 Why This Matters

🌍 Real World

Google Dataproc helps companies run big data jobs on the cloud easily. Filtering sales data is a common task to find important insights quickly.

💼 Career

Data analysts and data engineers use Dataproc and Spark to process large datasets efficiently without managing servers.

Progress0 / 4 steps

Create the sales data dictionary

Create a dictionary called sales_data with these exact entries: 'Apples': 150, 'Bananas': 90, 'Cherries': 120, 'Dates': 60, 'Elderberries': 30

Apache Spark

# Create the sales_data dictionary with product names and sales numbers
# Your code here

Need a hint?

Use curly braces {} to create a dictionary with product names as keys and sales numbers as values.

Set the sales threshold

Create a variable called sales_threshold and set it to 100

Apache Spark

sales_data = {'Apples': 150, 'Bananas': 90, 'Cherries': 120, 'Dates': 60, 'Elderberries': 30}
# Create the sales_threshold variable and set it to 100
# Your code here

Need a hint?

Just assign the number 100 to the variable sales_threshold.

Filter sales above the threshold

Use a dictionary comprehension to create a new dictionary called filtered_sales that includes only products from sales_data with sales greater than sales_threshold

Apache Spark

sales_data = {'Apples': 150, 'Bananas': 90, 'Cherries': 120, 'Dates': 60, 'Elderberries': 30}
sales_threshold = 100
# Create filtered_sales dictionary with products having sales > sales_threshold
# Your code here

Need a hint?

Use {product: sales for product, sales in sales_data.items() if sales > sales_threshold} to filter the dictionary.

Print the filtered sales

Print the filtered_sales dictionary

Apache Spark

sales_data = {'Apples': 150, 'Bananas': 90, 'Cherries': 120, 'Dates': 60, 'Elderberries': 30}
sales_threshold = 100
filtered_sales = {product: sales for product, sales in sales_data.items() if sales > sales_threshold}
# Print the filtered_sales dictionary
# Your code here

Need a hint?

Use print(filtered_sales) to show the filtered dictionary.