Window functions let you do calculations across rows related to the current row without grouping data. This helps keep all rows while adding useful info.
Window functions in Apache Spark
from pyspark.sql.window import Window from pyspark.sql.functions import row_number, rank, sum windowSpec = Window.partitionBy('category').orderBy('value') # Example: Add row number within each category result = df.withColumn('row_num', row_number().over(windowSpec))
partitionBy splits data into groups (like categories).
orderBy defines the order inside each group for calculations.
from pyspark.sql.window import Window from pyspark.sql.functions import row_number windowSpec = Window.partitionBy('department').orderBy('salary') df.withColumn('row_num', row_number().over(windowSpec))
from pyspark.sql.window import Window from pyspark.sql.functions import sum windowSpec = Window.partitionBy('store').orderBy('date').rowsBetween(Window.unboundedPreceding, 0) df.withColumn('running_total', sum('sales').over(windowSpec))
from pyspark.sql.window import Window from pyspark.sql.functions import rank windowSpec = Window.partitionBy('region').orderBy('profit') df.withColumn('rank', rank().over(windowSpec))
This program creates sales data for two categories over dates. It adds a running total of sales per category ordered by date and a row number for each row within the category.
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import row_number, sum spark = SparkSession.builder.appName('WindowFunctionsExample').getOrCreate() data = [ ('A', '2024-01-01', 100), ('A', '2024-01-02', 150), ('A', '2024-01-03', 200), ('B', '2024-01-01', 50), ('B', '2024-01-02', 80) ] columns = ['category', 'date', 'sales'] df = spark.createDataFrame(data, columns) windowSpec = Window.partitionBy('category').orderBy('date').rowsBetween(Window.unboundedPreceding, 0) result = df.withColumn('running_total', sum('sales').over(windowSpec)) result = result.withColumn('row_num', row_number().over(Window.partitionBy('category').orderBy('date'))) result.show()
Window functions do not reduce the number of rows like groupBy does.
Use partitionBy to define groups, and orderBy to define order inside groups.
Common window functions include row_number(), rank(), sum(), avg(), and lag().
Window functions let you add calculations across related rows without grouping.
They keep all rows and add new columns with useful info like running totals or ranks.
Use partitionBy and orderBy to control how data is grouped and ordered for these calculations.