How to Use orderBy in PySpark for Sorting DataFrames
In PySpark, use
orderBy on a DataFrame to sort data by one or more columns. You can specify ascending or descending order by passing column names or expressions with asc() or desc().Syntax
The orderBy function sorts a DataFrame by specified columns.
df.orderBy(col1, col2, ...): Sorts by columns in ascending order by default.df.orderBy(df.col('col1').desc()): Sorts by a column in descending order.- You can mix ascending and descending by using
asc()ordesc()on columns.
python
DataFrame.orderBy(*cols, ascending=True) # cols: one or more column names or Column expressions # ascending: bool or list of bools, True for ascending, False for descending
Example
This example shows how to create a PySpark DataFrame and sort it by one column ascending and another descending using orderBy.
python
from pyspark.sql import SparkSession from pyspark.sql.functions import desc spark = SparkSession.builder.master('local').appName('OrderByExample').getOrCreate() # Sample data data = [ (1, 'Alice', 50), (2, 'Bob', 45), (3, 'Cathy', 50), (4, 'David', 40) ] # Create DataFrame columns = ['id', 'name', 'score'] df = spark.createDataFrame(data, columns) # Sort by score ascending, then name descending sorted_df = df.orderBy('score', desc('name')) sorted_df.show()
Output
+---+-----+-----+
| id| name|score|
+---+-----+-----+
| 4|David| 40|
| 2| Bob| 45|
| 3|Cathy| 50|
| 1|Alice| 50|
+---+-----+-----+
Common Pitfalls
Common mistakes when using orderBy include:
- Passing column names as strings without specifying ascending/descending when mixed order is needed.
- Using
sortinstead oforderBy(both work butorderByis preferred for clarity). - Not importing
descorascfunctions when trying to specify order.
python
from pyspark.sql.functions import desc # Wrong: trying to pass descending as string # df.orderBy('score', 'desc') # This will not sort descending # Right: use desc() function # df.orderBy('score', desc('name'))
Quick Reference
| Usage | Description |
|---|---|
| df.orderBy('col') | Sort by 'col' ascending (default) |
| df.orderBy(df.col('col').desc()) | Sort by 'col' descending |
| df.orderBy('col1', desc('col2')) | Sort by 'col1' ascending, then 'col2' descending |
| df.orderBy(['col1', 'col2'], ascending=[True, False]) | Sort by multiple columns with specified order |
Key Takeaways
Use orderBy on a DataFrame to sort by one or more columns.
By default, orderBy sorts columns in ascending order.
Use desc() or asc() functions to specify descending or ascending order explicitly.
You can mix ascending and descending order for different columns in one orderBy call.
Always import desc and asc from pyspark.sql.functions when needed.