What is Window functions in Apache Spark?

Apache Sparkdata~5 mins

Window functions in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Window functions let you do calculations across rows related to the current row without grouping data. This helps keep all rows while adding useful info.

Calculate running totals or moving averages in sales data.

Rank items within categories, like top products per region.

Compare each row to previous or next rows, like daily temperature changes.

Add row numbers or dense ranks without losing detail rows.

Syntax

Apache Spark

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, sum

windowSpec = Window.partitionBy('category').orderBy('value')

# Example: Add row number within each category
result = df.withColumn('row_num', row_number().over(windowSpec))

partitionBy splits data into groups (like categories).

orderBy defines the order inside each group for calculations.

Examples

Adds a row number to each employee within their department, ordered by salary.

Apache Spark

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy('department').orderBy('salary')
df.withColumn('row_num', row_number().over(windowSpec))

Calculates running total of sales per store ordered by date.

Apache Spark

from pyspark.sql.window import Window
from pyspark.sql.functions import sum

windowSpec = Window.partitionBy('store').orderBy('date').rowsBetween(Window.unboundedPreceding, 0)
df.withColumn('running_total', sum('sales').over(windowSpec))

Ranks items by profit within each region.

Apache Spark

from pyspark.sql.window import Window
from pyspark.sql.functions import rank

windowSpec = Window.partitionBy('region').orderBy('profit')
df.withColumn('rank', rank().over(windowSpec))

Sample Program

This program creates sales data for two categories over dates. It adds a running total of sales per category ordered by date and a row number for each row within the category.

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, sum

spark = SparkSession.builder.appName('WindowFunctionsExample').getOrCreate()

data = [
    ('A', '2024-01-01', 100),
    ('A', '2024-01-02', 150),
    ('A', '2024-01-03', 200),
    ('B', '2024-01-01', 50),
    ('B', '2024-01-02', 80)
]

columns = ['category', 'date', 'sales']
df = spark.createDataFrame(data, columns)

windowSpec = Window.partitionBy('category').orderBy('date').rowsBetween(Window.unboundedPreceding, 0)

result = df.withColumn('running_total', sum('sales').over(windowSpec))
result = result.withColumn('row_num', row_number().over(Window.partitionBy('category').orderBy('date')))

result.show()

OutputSuccess

Important Notes

Window functions do not reduce the number of rows like groupBy does.

Use partitionBy to define groups, and orderBy to define order inside groups.

Common window functions include row_number(), rank(), sum(), avg(), and lag().

Summary

Window functions let you add calculations across related rows without grouping.

They keep all rows and add new columns with useful info like running totals or ranks.

Use partitionBy and orderBy to control how data is grouped and ordered for these calculations.