0
0
Apache Sparkdata~30 mins

GroupBy and aggregations in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
GroupBy and aggregations
📖 Scenario: You work at a small online store. You have a list of sales records with product names and quantities sold. You want to find out how many units of each product were sold in total.
🎯 Goal: Build a Spark DataFrame from sales data, then group the data by product name and calculate the total quantity sold for each product.
📋 What You'll Learn
Create a Spark DataFrame with columns product and quantity using the exact data provided.
Create a variable called grouped_data that groups the DataFrame by product.
Use the agg function with sum aggregation on the quantity column.
Rename the aggregated column to total_quantity.
Print the resulting DataFrame using show().
💡 Why This Matters
🌍 Real World
Grouping and aggregating data is common in sales analysis, inventory management, and reporting to understand totals and summaries.
💼 Career
Data analysts and data scientists use groupBy and aggregation to summarize large datasets and extract meaningful insights for business decisions.
Progress0 / 4 steps
1
Create the sales DataFrame
Create a Spark DataFrame called sales_df with columns product and quantity using this exact data: [('apple', 10), ('banana', 5), ('apple', 7), ('banana', 3), ('orange', 8)].
Apache Spark
Need a hint?

Use spark.createDataFrame() with the list of tuples and specify the column names as ['product', 'quantity'].

2
Group the data by product
Create a variable called grouped_data that groups sales_df by the product column using the groupBy method.
Apache Spark
Need a hint?

Use sales_df.groupBy('product') to group the data by product.

3
Aggregate the total quantity sold
Use the agg function on grouped_data to calculate the sum of the quantity column. Rename the aggregated column to total_quantity and assign the result back to grouped_data.
Apache Spark
Need a hint?

Use agg(sum('quantity').alias('total_quantity')) on grouped_data.

4
Display the aggregated results
Print the grouped_data DataFrame using the show() method to display the total quantity sold for each product.
Apache Spark
Need a hint?

Use grouped_data.show() to display the results.