Apache Sparkdata~30 mins

Windowed aggregations in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Windowed aggregations

📖 Scenario: You work for a retail company that tracks daily sales of products. You want to analyze sales trends by calculating the total sales for each product over a moving 3-day window.

🎯 Goal: Build a Spark program that uses windowed aggregations to calculate the 3-day rolling sum of sales for each product.

📋 What You'll Learn

Create a Spark DataFrame with sales data for products over several days

Define a window specification partitioned by product and ordered by date

Use a window function to calculate the rolling 3-day sum of sales

Display the final DataFrame with the rolling sums

💡 Why This Matters

🌍 Real World

Windowed aggregations help businesses analyze trends over time, like moving averages or rolling sums, which are useful for sales forecasting and inventory management.

💼 Career

Data scientists and analysts use window functions in Spark to efficiently compute time-based metrics on large datasets, a common task in many data-driven roles.

Progress0 / 4 steps

Create the sales DataFrame

Create a Spark DataFrame called sales_df with these exact rows: ("2024-01-01", "A", 10), ("2024-01-02", "A", 20), ("2024-01-03", "A", 30), ("2024-01-01", "B", 5), ("2024-01-02", "B", 15), ("2024-01-03", "B", 25). The columns must be named date (string), product (string), and sales (integer).

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WindowedAggregations").getOrCreate()

# Create the sales_df DataFrame with the specified rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame() with a list of tuples and specify the column names as a list.

Define the window specification

Create a window specification called window_spec that partitions data by product and orders by date. The window frame should include the current row and the two previous rows (3-day window). Use Window.partitionBy("product").orderBy("date").rowsBetween(-2, 0).

Apache Spark

from pyspark.sql import Window

# Define the window_spec with partitionBy product, orderBy date, and rowsBetween -2 and 0
# Your code here

Need a hint?

Import Window from pyspark.sql and chain partitionBy, orderBy, and rowsBetween.

Calculate the rolling 3-day sum

Create a new DataFrame called result_df by adding a column named rolling_sum to sales_df. Use the sum function over the window_spec on the sales column.

Apache Spark

from pyspark.sql.functions import sum

# Add a column rolling_sum to sales_df using sum over window_spec
# Your code here

Need a hint?

Use withColumn on sales_df and apply sum("sales").over(window_spec).

Display the rolling sums

Use print to display the contents of result_df by calling result_df.show().

Apache Spark

# Display the result_df DataFrame
# Your code here

Need a hint?

Call result_df.show() to print the DataFrame to the console.