How to Use Window Function in PySpark: Syntax and Example
In PySpark, use
Window from pyspark.sql.window to define a window specification, then apply window functions like row_number() or rank() over that window using withColumn(). This lets you perform calculations across rows related to the current row without grouping the data.Syntax
To use window functions in PySpark, first import Window and the functions you need from pyspark.sql.functions. Define a window specification with partitionBy and orderBy to set how rows are grouped and ordered. Then apply a window function like row_number() over this window inside withColumn() to create a new column.
- Window.partitionBy(): Groups rows by column(s).
- Window.orderBy(): Orders rows within each group.
- Window.rowsBetween(): Defines frame boundaries (optional).
- Window functions: Functions like
row_number(),rank(),lag(),lead().
python
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import row_number windowSpec = Window.partitionBy('category').orderBy('value') df = df.withColumn('row_num', row_number().over(windowSpec))
Example
This example shows how to add a row number to each row within groups defined by the 'category' column, ordered by the 'value' column.
python
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import row_number spark = SparkSession.builder.appName('WindowFunctionExample').getOrCreate() data = [ ('A', 10), ('A', 20), ('B', 15), ('B', 5), ('A', 30), ('B', 25) ] columns = ['category', 'value'] df = spark.createDataFrame(data, columns) windowSpec = Window.partitionBy('category').orderBy('value') df_with_rownum = df.withColumn('row_num', row_number().over(windowSpec)) df_with_rownum.show()
Output
+--------+-----+-------+
|category|value|row_num|
+--------+-----+-------+
| A| 10| 1|
| A| 20| 2|
| A| 30| 3|
| B| 5| 1|
| B| 15| 2|
| B| 25| 3|
+--------+-----+-------+
Common Pitfalls
Common mistakes when using window functions in PySpark include:
- Not defining a
Windowspecification before applying the function. - Forgetting to use
.over(windowSpec)with the window function. - Using
orderBywithoutpartitionBywhen grouping is needed, which can lead to unexpected results. - Confusing window functions with aggregation functions that reduce rows.
Always remember window functions keep the original number of rows.
python
from pyspark.sql.functions import rank # Wrong: missing .over(windowSpec) df.withColumn('rank', rank()) # This will error # Right: df.withColumn('rank', rank().over(windowSpec))
Quick Reference
| Function | Description | Example Usage |
|---|---|---|
| row_number() | Assigns unique row number starting at 1 within window | row_number().over(windowSpec) |
| rank() | Ranks rows with gaps for ties | rank().over(windowSpec) |
| dense_rank() | Ranks rows without gaps for ties | dense_rank().over(windowSpec) |
| lag(col, offset) | Accesses previous row value | lag('value', 1).over(windowSpec) |
| lead(col, offset) | Accesses next row value | lead('value', 1).over(windowSpec) |
Key Takeaways
Define a window specification with partitionBy and orderBy before using window functions.
Apply window functions with .over(windowSpec) inside withColumn to add new columns.
Window functions do not reduce rows; they calculate values across related rows.
Common window functions include row_number(), rank(), lag(), and lead().
Avoid forgetting the .over() call or mixing window functions with aggregations.