0
0
Apache Sparkdata~30 mins

Reduce and aggregate actions in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Reduce and aggregate actions
📖 Scenario: You work at a small online store. You have a list of sales records showing the product name and the quantity sold. You want to find out the total quantity sold for each product.
🎯 Goal: Build a Spark program that sums the quantities sold for each product using reduce and aggregate actions.
📋 What You'll Learn
Create an RDD with the exact sales data given
Create a variable for the minimum quantity threshold
Use reduceByKey to sum quantities for each product
Print the final aggregated result
💡 Why This Matters
🌍 Real World
Summing sales quantities per product is a common task in retail analytics to understand product performance.
💼 Career
Data scientists and analysts often use reduce and aggregate actions in Spark to process large datasets efficiently.
Progress0 / 4 steps
1
Create the sales data RDD
Create a Spark RDD called sales_rdd from the list of tuples: [('apple', 10), ('banana', 5), ('apple', 3), ('banana', 7), ('orange', 8)].
Apache Spark
Need a hint?

Use sc.parallelize() to create an RDD from the list of tuples.

2
Set the minimum quantity threshold
Create a variable called min_quantity and set it to 5.
Apache Spark
Need a hint?

Just assign the number 5 to the variable min_quantity.

3
Sum quantities for each product using reduceByKey
Use reduceByKey on sales_rdd with a lambda function that adds two quantities to create a new RDD called total_sales_rdd.
Apache Spark
Need a hint?

Use reduceByKey(lambda x, y: x + y) to add quantities for each product.

4
Print the total sales result
Print the list of tuples from total_sales_rdd.collect() to show the total quantity sold for each product.
Apache Spark
Need a hint?

Use print(total_sales_rdd.collect()) to display the final sums.