0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use Rank Function in PySpark for Data Ranking

In PySpark, you use the rank() function from pyspark.sql.functions with a Window specification to assign ranks to rows within partitions ordered by specific columns. This helps you number rows with ties getting the same rank and gaps in ranking for duplicates.
๐Ÿ“

Syntax

The rank() function is used with a Window specification that defines how to partition and order the data. The basic syntax is:

  • rank().over(windowSpec): Applies ranking over the window.
  • Window.partitionBy(columns): Divides data into groups.
  • Window.orderBy(columns): Defines the order within each group.
python
from pyspark.sql.window import Window
from pyspark.sql.functions import rank

windowSpec = Window.partitionBy('group_column').orderBy('order_column')
ranked_df = df.withColumn('rank', rank().over(windowSpec))
๐Ÿ’ป

Example

This example shows how to rank employees by their salary within each department. Employees with the same salary get the same rank, and ranks skip numbers after ties.

python
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import rank

spark = SparkSession.builder.appName('RankExample').getOrCreate()

# Sample data
data = [
    ('Sales', 'Alice', 5000),
    ('Sales', 'Bob', 6000),
    ('Sales', 'Charlie', 6000),
    ('HR', 'David', 4500),
    ('HR', 'Eve', 4500),
    ('HR', 'Frank', 4000)
]

columns = ['department', 'employee', 'salary']

df = spark.createDataFrame(data, columns)

# Define window specification
windowSpec = Window.partitionBy('department').orderBy(df['salary'].desc())

# Apply rank
ranked_df = df.withColumn('rank', rank().over(windowSpec))

ranked_df.show()
Output
+----------+--------+------+----+ |department|employee|salary|rank| +----------+--------+------+----+ | Sales| Bob| 6000| 1| | Sales| Charlie| 6000| 1| | Sales| Alice| 5000| 3| | HR| David| 4500| 1| | HR| Eve| 4500| 1| | HR| Frank| 4000| 3| +----------+--------+------+----+
โš ๏ธ

Common Pitfalls

Common mistakes when using rank() in PySpark include:

  • Not defining a Window specification, which causes errors.
  • Using orderBy without partitionBy when grouping is needed.
  • Confusing rank() with dense_rank(): rank() leaves gaps after ties, dense_rank() does not.
python
from pyspark.sql.window import Window
from pyspark.sql.functions import rank

# Wrong: missing window specification
# df.withColumn('rank', rank())  # This will raise an error

# Correct: define window
windowSpec = Window.partitionBy('group').orderBy('value')
ranked_df = df.withColumn('rank', rank().over(windowSpec))
๐Ÿ“Š

Quick Reference

FunctionDescription
rank()Assigns rank with gaps for ties within window
dense_rank()Assigns rank without gaps for ties
Window.partitionBy(cols)Groups data into partitions
Window.orderBy(cols)Orders data within partitions
withColumn(name, expr)Adds a new column with expression
โœ…

Key Takeaways

Use rank() with a Window specification to assign ranks within groups in PySpark.
Define partitionBy() to group rows and orderBy() to sort rows before ranking.
rank() assigns the same rank to ties but leaves gaps in ranking numbers.
Always specify the window when using rank() to avoid errors.
For continuous ranks without gaps, consider using dense_rank() instead.