How to Use row_number in PySpark for Row Ranking
In PySpark, use
row_number() from pyspark.sql.functions with a Window specification to assign unique row numbers to rows within partitions. This helps rank or order rows based on column values inside groups.Syntax
The row_number() function is used with a Window specification that defines how to partition and order the data. The basic syntax is:
row_number(): Assigns a unique sequential number starting at 1 for each row in a window partition.Window.partitionBy(): Defines the column(s) to group rows.Window.orderBy(): Defines the column(s) to sort rows within each partition.
python
from pyspark.sql.window import Window from pyspark.sql.functions import row_number window_spec = Window.partitionBy('group_column').orderBy('order_column') df = df.withColumn('row_num', row_number().over(window_spec))
Example
This example shows how to assign row numbers to employees within each department ordered by their salary descending. It demonstrates grouping by department and ordering by salary.
python
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import row_number spark = SparkSession.builder.appName('RowNumberExample').getOrCreate() # Sample data data = [ ('Sales', 'Alice', 5000), ('Sales', 'Bob', 4800), ('HR', 'Charlie', 4500), ('HR', 'David', 4700), ('Sales', 'Eve', 5200) ] columns = ['department', 'employee', 'salary'] df = spark.createDataFrame(data, columns) # Define window specification window_spec = Window.partitionBy('department').orderBy(df.salary.desc()) # Add row_number column result_df = df.withColumn('row_number', row_number().over(window_spec)) result_df.show()
Output
+----------+--------+------+----------+
|department|employee|salary|row_number|
+----------+--------+------+----------+
| HR| David| 4700| 1|
| HR| Charlie| 4500| 2|
| Sales| Eve| 5200| 1|
| Sales| Alice| 5000| 2|
| Sales| Bob| 4800| 3|
+----------+--------+------+----------+
Common Pitfalls
- Not defining a
Windowspecification causes errors becauserow_number()needs context to assign numbers. - Using
orderBywithoutpartitionByassigns row numbers globally, not per group. - For unordered data, row numbers may not be meaningful; always specify
orderBy.
python
from pyspark.sql.window import Window from pyspark.sql.functions import row_number # Wrong: missing window spec # df.withColumn('row_num', row_number()) # This will error # Right: define window with partition and order window_spec = Window.partitionBy('group').orderBy('value') df = df.withColumn('row_num', row_number().over(window_spec))
Quick Reference
| Function | Purpose | Example Usage |
|---|---|---|
| row_number() | Assigns unique row numbers in window | row_number().over(window_spec) |
| Window.partitionBy(cols) | Groups rows by columns | Window.partitionBy('dept') |
| Window.orderBy(cols) | Orders rows within partitions | Window.orderBy(df.salary.desc()) |
Key Takeaways
Use row_number() with a Window spec to assign unique row numbers within groups.
Always define partitionBy and orderBy in the Window to control grouping and sorting.
row_number() starts counting at 1 for each partition.
Without a Window spec, row_number() will cause errors.
Ordering is important to get meaningful row numbers.