How to use window function pyspark

Apache-sparkHow-ToBeginner · 4 min read

How to Use Window Function in PySpark: Syntax and Example

In PySpark, use Window from pyspark.sql.window to define a window specification, then apply window functions like row_number() or rank() over that window using withColumn(). This lets you perform calculations across rows related to the current row without grouping the data.

📐

Syntax

To use window functions in PySpark, first import Window and the functions you need from pyspark.sql.functions. Define a window specification with partitionBy and orderBy to set how rows are grouped and ordered. Then apply a window function like row_number() over this window inside withColumn() to create a new column.

Window.partitionBy(): Groups rows by column(s).
Window.orderBy(): Orders rows within each group.
Window.rowsBetween(): Defines frame boundaries (optional).
Window functions: Functions like row_number(), rank(), lag(), lead().

python

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy('category').orderBy('value')
df = df.withColumn('row_num', row_number().over(windowSpec))

💻

Example

This example shows how to add a row number to each row within groups defined by the 'category' column, ordered by the 'value' column.

python

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

spark = SparkSession.builder.appName('WindowFunctionExample').getOrCreate()
data = [
    ('A', 10),
    ('A', 20),
    ('B', 15),
    ('B', 5),
    ('A', 30),
    ('B', 25)
]
columns = ['category', 'value']
df = spark.createDataFrame(data, columns)

windowSpec = Window.partitionBy('category').orderBy('value')
df_with_rownum = df.withColumn('row_num', row_number().over(windowSpec))
df_with_rownum.show()

Output

+--------+-----+-------+ |category|value|row_num| +--------+-----+-------+ | A| 10| 1| | A| 20| 2| | A| 30| 3| | B| 5| 1| | B| 15| 2| | B| 25| 3| +--------+-----+-------+

⚠️

Common Pitfalls

Common mistakes when using window functions in PySpark include:

Not defining a Window specification before applying the function.
Forgetting to use .over(windowSpec) with the window function.
Using orderBy without partitionBy when grouping is needed, which can lead to unexpected results.
Confusing window functions with aggregation functions that reduce rows.

Always remember window functions keep the original number of rows.

python

from pyspark.sql.functions import rank

# Wrong: missing .over(windowSpec)
df.withColumn('rank', rank())  # This will error

# Right:
df.withColumn('rank', rank().over(windowSpec))

📊

Quick Reference

Function	Description	Example Usage
row_number()	Assigns unique row number starting at 1 within window	row_number().over(windowSpec)
rank()	Ranks rows with gaps for ties	rank().over(windowSpec)
dense_rank()	Ranks rows without gaps for ties	dense_rank().over(windowSpec)
lag(col, offset)	Accesses previous row value	lag('value', 1).over(windowSpec)
lead(col, offset)	Accesses next row value	lead('value', 1).over(windowSpec)

✅

Key Takeaways

Define a window specification with partitionBy and orderBy before using window functions.

Apply window functions with .over(windowSpec) inside withColumn to add new columns.

Window functions do not reduce rows; they calculate values across related rows.

Common window functions include row_number(), rank(), lag(), and lead().

Avoid forgetting the .over() call or mixing window functions with aggregations.