0
0
Apache Sparkdata~30 mins

Window functions in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available
Analyzing Sales Data Using Window Functions in Apache Spark
📖 Scenario: You work for a retail company that wants to analyze monthly sales data for different stores. You have a dataset with sales amounts for each store by month. Your task is to calculate the running total of sales for each store over the months using window functions.
🎯 Goal: Build a Spark program that uses window functions to calculate the cumulative sales for each store by month.
📋 What You'll Learn
Create a Spark DataFrame with sales data for stores and months
Define a window specification partitioned by store and ordered by month
Use the window function to calculate cumulative sales
Display the final DataFrame with cumulative sales
💡 Why This Matters
🌍 Real World
Retail companies often analyze sales trends over time per store to make inventory and marketing decisions. Window functions help calculate running totals and rankings easily.
💼 Career
Data analysts and data scientists use window functions in Spark to perform advanced data analysis on large datasets efficiently.
Progress0 / 4 steps
1
Create the sales DataFrame
Create a Spark DataFrame called sales_df with these exact rows: ("StoreA", 1, 100), ("StoreA", 2, 150), ("StoreB", 1, 200), ("StoreB", 2, 300). The columns should be store, month, and sales.
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of tuples and specify the column names.

2
Define the window specification
Create a window specification called window_spec that partitions data by store and orders by month in ascending order. Use Window.partitionBy("store").orderBy("month").
Apache Spark
Need a hint?

Import Window from pyspark.sql and use partitionBy and orderBy.

3
Calculate cumulative sales using window function
Add a new column called cumulative_sales to sales_df that contains the running total of sales for each store ordered by month. Use F.sum("sales").over(window_spec) and assign the result back to sales_df.
Apache Spark
Need a hint?

Use withColumn and F.sum().over(window_spec) to create the cumulative sales column.

4
Display the final DataFrame
Use sales_df.show() to display the DataFrame with the new cumulative_sales column.
Apache Spark
Need a hint?

Call show() on sales_df to print the table.