0
0
Apache-sparkHow-ToBeginner ยท 4 min read

How to Use Window Function in PySpark: Syntax and Example

In PySpark, use Window from pyspark.sql.window to define a window specification, then apply window functions like row_number() or rank() over that window using withColumn(). This lets you perform calculations across rows related to the current row without grouping the data.
๐Ÿ“

Syntax

To use window functions in PySpark, first import Window and the functions you need from pyspark.sql.functions. Define a window specification with partitionBy and orderBy to set how rows are grouped and ordered. Then apply a window function like row_number() over this window inside withColumn() to create a new column.

  • Window.partitionBy(): Groups rows by column(s).
  • Window.orderBy(): Orders rows within each group.
  • Window.rowsBetween(): Defines frame boundaries (optional).
  • Window functions: Functions like row_number(), rank(), lag(), lead().
python
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy('category').orderBy('value')
df = df.withColumn('row_num', row_number().over(windowSpec))
๐Ÿ’ป

Example

This example shows how to add a row number to each row within groups defined by the 'category' column, ordered by the 'value' column.

python
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

spark = SparkSession.builder.appName('WindowFunctionExample').getOrCreate()
data = [
    ('A', 10),
    ('A', 20),
    ('B', 15),
    ('B', 5),
    ('A', 30),
    ('B', 25)
]
columns = ['category', 'value']
df = spark.createDataFrame(data, columns)

windowSpec = Window.partitionBy('category').orderBy('value')
df_with_rownum = df.withColumn('row_num', row_number().over(windowSpec))
df_with_rownum.show()
Output
+--------+-----+-------+ |category|value|row_num| +--------+-----+-------+ | A| 10| 1| | A| 20| 2| | A| 30| 3| | B| 5| 1| | B| 15| 2| | B| 25| 3| +--------+-----+-------+
โš ๏ธ

Common Pitfalls

Common mistakes when using window functions in PySpark include:

  • Not defining a Window specification before applying the function.
  • Forgetting to use .over(windowSpec) with the window function.
  • Using orderBy without partitionBy when grouping is needed, which can lead to unexpected results.
  • Confusing window functions with aggregation functions that reduce rows.

Always remember window functions keep the original number of rows.

python
from pyspark.sql.functions import rank

# Wrong: missing .over(windowSpec)
df.withColumn('rank', rank())  # This will error

# Right:
df.withColumn('rank', rank().over(windowSpec))
๐Ÿ“Š

Quick Reference

FunctionDescriptionExample Usage
row_number()Assigns unique row number starting at 1 within windowrow_number().over(windowSpec)
rank()Ranks rows with gaps for tiesrank().over(windowSpec)
dense_rank()Ranks rows without gaps for tiesdense_rank().over(windowSpec)
lag(col, offset)Accesses previous row valuelag('value', 1).over(windowSpec)
lead(col, offset)Accesses next row valuelead('value', 1).over(windowSpec)
โœ…

Key Takeaways

Define a window specification with partitionBy and orderBy before using window functions.
Apply window functions with .over(windowSpec) inside withColumn to add new columns.
Window functions do not reduce rows; they calculate values across related rows.
Common window functions include row_number(), rank(), lag(), and lead().
Avoid forgetting the .over() call or mixing window functions with aggregations.