Apache Sparkdata~30 mins

Window functions in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Analyzing Sales Data Using Window Functions in Apache Spark

📖 Scenario: You work for a retail company that wants to analyze monthly sales data for different stores. You have a dataset with sales amounts for each store by month. Your task is to calculate the running total of sales for each store over the months using window functions.

🎯 Goal: Build a Spark program that uses window functions to calculate the cumulative sales for each store by month.

📋 What You'll Learn

Create a Spark DataFrame with sales data for stores and months

Define a window specification partitioned by store and ordered by month

Use the window function to calculate cumulative sales

Display the final DataFrame with cumulative sales

💡 Why This Matters

🌍 Real World

Retail companies often analyze sales trends over time per store to make inventory and marketing decisions. Window functions help calculate running totals and rankings easily.

💼 Career

Data analysts and data scientists use window functions in Spark to perform advanced data analysis on large datasets efficiently.

Progress0 / 4 steps

Create the sales DataFrame

Create a Spark DataFrame called sales_df with these exact rows: ("StoreA", 1, 100), ("StoreA", 2, 150), ("StoreB", 1, 200), ("StoreB", 2, 300). The columns should be store, month, and sales.

Apache Spark

# Create the sales_df DataFrame with the specified data
# Your code here

Need a hint?

Use spark.createDataFrame() with a list of tuples and specify the column names.

Define the window specification

Create a window specification called window_spec that partitions data by store and orders by month in ascending order. Use Window.partitionBy("store").orderBy("month").

Apache Spark

from pyspark.sql import Window

# Define the window_spec variable
# Your code here

Need a hint?

Import Window from pyspark.sql and use partitionBy and orderBy.

Calculate cumulative sales using window function

Add a new column called cumulative_sales to sales_df that contains the running total of sales for each store ordered by month. Use F.sum("sales").over(window_spec) and assign the result back to sales_df.

Apache Spark

sales_df = sales_df.withColumn("cumulative_sales", F.sum("sales").over(window_spec))
# Your code here

Need a hint?

Use withColumn and F.sum().over(window_spec) to create the cumulative sales column.

Display the final DataFrame

Use sales_df.show() to display the DataFrame with the new cumulative_sales column.

Apache Spark

sales_df.show()
# Your code here

Need a hint?

Call show() on sales_df to print the table.