0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use row_number in PySpark for Row Ranking

In PySpark, use row_number() from pyspark.sql.functions with a Window specification to assign unique row numbers to rows within partitions. This helps rank or order rows based on column values inside groups.
๐Ÿ“

Syntax

The row_number() function is used with a Window specification that defines how to partition and order the data. The basic syntax is:

  • row_number(): Assigns a unique sequential number starting at 1 for each row in a window partition.
  • Window.partitionBy(): Defines the column(s) to group rows.
  • Window.orderBy(): Defines the column(s) to sort rows within each partition.
python
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy('group_column').orderBy('order_column')
df = df.withColumn('row_num', row_number().over(window_spec))
๐Ÿ’ป

Example

This example shows how to assign row numbers to employees within each department ordered by their salary descending. It demonstrates grouping by department and ordering by salary.

python
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

spark = SparkSession.builder.appName('RowNumberExample').getOrCreate()

# Sample data
data = [
    ('Sales', 'Alice', 5000),
    ('Sales', 'Bob', 4800),
    ('HR', 'Charlie', 4500),
    ('HR', 'David', 4700),
    ('Sales', 'Eve', 5200)
]

columns = ['department', 'employee', 'salary']
df = spark.createDataFrame(data, columns)

# Define window specification
window_spec = Window.partitionBy('department').orderBy(df.salary.desc())

# Add row_number column
result_df = df.withColumn('row_number', row_number().over(window_spec))

result_df.show()
Output
+----------+--------+------+----------+ |department|employee|salary|row_number| +----------+--------+------+----------+ | HR| David| 4700| 1| | HR| Charlie| 4500| 2| | Sales| Eve| 5200| 1| | Sales| Alice| 5000| 2| | Sales| Bob| 4800| 3| +----------+--------+------+----------+
โš ๏ธ

Common Pitfalls

  • Not defining a Window specification causes errors because row_number() needs context to assign numbers.
  • Using orderBy without partitionBy assigns row numbers globally, not per group.
  • For unordered data, row numbers may not be meaningful; always specify orderBy.
python
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

# Wrong: missing window spec
# df.withColumn('row_num', row_number())  # This will error

# Right: define window with partition and order
window_spec = Window.partitionBy('group').orderBy('value')
df = df.withColumn('row_num', row_number().over(window_spec))
๐Ÿ“Š

Quick Reference

FunctionPurposeExample Usage
row_number()Assigns unique row numbers in windowrow_number().over(window_spec)
Window.partitionBy(cols)Groups rows by columnsWindow.partitionBy('dept')
Window.orderBy(cols)Orders rows within partitionsWindow.orderBy(df.salary.desc())
โœ…

Key Takeaways

Use row_number() with a Window spec to assign unique row numbers within groups.
Always define partitionBy and orderBy in the Window to control grouping and sorting.
row_number() starts counting at 1 for each partition.
Without a Window spec, row_number() will cause errors.
Ordering is important to get meaningful row numbers.