How to Use Rank Function in PySpark for Data Ranking
In PySpark, you use the
rank() function from pyspark.sql.functions with a Window specification to assign ranks to rows within partitions ordered by specific columns. This helps you number rows with ties getting the same rank and gaps in ranking for duplicates.Syntax
The rank() function is used with a Window specification that defines how to partition and order the data. The basic syntax is:
rank().over(windowSpec): Applies ranking over the window.Window.partitionBy(columns): Divides data into groups.Window.orderBy(columns): Defines the order within each group.
python
from pyspark.sql.window import Window from pyspark.sql.functions import rank windowSpec = Window.partitionBy('group_column').orderBy('order_column') ranked_df = df.withColumn('rank', rank().over(windowSpec))
Example
This example shows how to rank employees by their salary within each department. Employees with the same salary get the same rank, and ranks skip numbers after ties.
python
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import rank spark = SparkSession.builder.appName('RankExample').getOrCreate() # Sample data data = [ ('Sales', 'Alice', 5000), ('Sales', 'Bob', 6000), ('Sales', 'Charlie', 6000), ('HR', 'David', 4500), ('HR', 'Eve', 4500), ('HR', 'Frank', 4000) ] columns = ['department', 'employee', 'salary'] df = spark.createDataFrame(data, columns) # Define window specification windowSpec = Window.partitionBy('department').orderBy(df['salary'].desc()) # Apply rank ranked_df = df.withColumn('rank', rank().over(windowSpec)) ranked_df.show()
Output
+----------+--------+------+----+
|department|employee|salary|rank|
+----------+--------+------+----+
| Sales| Bob| 6000| 1|
| Sales| Charlie| 6000| 1|
| Sales| Alice| 5000| 3|
| HR| David| 4500| 1|
| HR| Eve| 4500| 1|
| HR| Frank| 4000| 3|
+----------+--------+------+----+
Common Pitfalls
Common mistakes when using rank() in PySpark include:
- Not defining a
Windowspecification, which causes errors. - Using
orderBywithoutpartitionBywhen grouping is needed. - Confusing
rank()withdense_rank():rank()leaves gaps after ties,dense_rank()does not.
python
from pyspark.sql.window import Window from pyspark.sql.functions import rank # Wrong: missing window specification # df.withColumn('rank', rank()) # This will raise an error # Correct: define window windowSpec = Window.partitionBy('group').orderBy('value') ranked_df = df.withColumn('rank', rank().over(windowSpec))
Quick Reference
| Function | Description |
|---|---|
| rank() | Assigns rank with gaps for ties within window |
| dense_rank() | Assigns rank without gaps for ties |
| Window.partitionBy(cols) | Groups data into partitions |
| Window.orderBy(cols) | Orders data within partitions |
| withColumn(name, expr) | Adds a new column with expression |
Key Takeaways
Use rank() with a Window specification to assign ranks within groups in PySpark.
Define partitionBy() to group rows and orderBy() to sort rows before ranking.
rank() assigns the same rank to ties but leaves gaps in ranking numbers.
Always specify the window when using rank() to avoid errors.
For continuous ranks without gaps, consider using dense_rank() instead.