0
0
Apache Sparkdata~30 mins

Windowed aggregations in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Windowed aggregations
📖 Scenario: You work for a retail company that tracks daily sales of products. You want to analyze sales trends by calculating the total sales for each product over a moving 3-day window.
🎯 Goal: Build a Spark program that uses windowed aggregations to calculate the 3-day rolling sum of sales for each product.
📋 What You'll Learn
Create a Spark DataFrame with sales data for products over several days
Define a window specification partitioned by product and ordered by date
Use a window function to calculate the rolling 3-day sum of sales
Display the final DataFrame with the rolling sums
💡 Why This Matters
🌍 Real World
Windowed aggregations help businesses analyze trends over time, like moving averages or rolling sums, which are useful for sales forecasting and inventory management.
💼 Career
Data scientists and analysts use window functions in Spark to efficiently compute time-based metrics on large datasets, a common task in many data-driven roles.
Progress0 / 4 steps
1
Create the sales DataFrame
Create a Spark DataFrame called sales_df with these exact rows: ("2024-01-01", "A", 10), ("2024-01-02", "A", 20), ("2024-01-03", "A", 30), ("2024-01-01", "B", 5), ("2024-01-02", "B", 15), ("2024-01-03", "B", 25). The columns must be named date (string), product (string), and sales (integer).
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of tuples and specify the column names as a list.

2
Define the window specification
Create a window specification called window_spec that partitions data by product and orders by date. The window frame should include the current row and the two previous rows (3-day window). Use Window.partitionBy("product").orderBy("date").rowsBetween(-2, 0).
Apache Spark
Need a hint?

Import Window from pyspark.sql and chain partitionBy, orderBy, and rowsBetween.

3
Calculate the rolling 3-day sum
Create a new DataFrame called result_df by adding a column named rolling_sum to sales_df. Use the sum function over the window_spec on the sales column.
Apache Spark
Need a hint?

Use withColumn on sales_df and apply sum("sales").over(window_spec).

4
Display the rolling sums
Use print to display the contents of result_df by calling result_df.show().
Apache Spark
Need a hint?

Call result_df.show() to print the DataFrame to the console.