0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use orderBy in PySpark for Sorting DataFrames

In PySpark, use orderBy on a DataFrame to sort data by one or more columns. You can specify ascending or descending order by passing column names or expressions with asc() or desc().
๐Ÿ“

Syntax

The orderBy function sorts a DataFrame by specified columns.

  • df.orderBy(col1, col2, ...): Sorts by columns in ascending order by default.
  • df.orderBy(df.col('col1').desc()): Sorts by a column in descending order.
  • You can mix ascending and descending by using asc() or desc() on columns.
python
DataFrame.orderBy(*cols, ascending=True)

# cols: one or more column names or Column expressions
# ascending: bool or list of bools, True for ascending, False for descending
๐Ÿ’ป

Example

This example shows how to create a PySpark DataFrame and sort it by one column ascending and another descending using orderBy.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

spark = SparkSession.builder.master('local').appName('OrderByExample').getOrCreate()

# Sample data
data = [
    (1, 'Alice', 50),
    (2, 'Bob', 45),
    (3, 'Cathy', 50),
    (4, 'David', 40)
]

# Create DataFrame
columns = ['id', 'name', 'score']
df = spark.createDataFrame(data, columns)

# Sort by score ascending, then name descending
sorted_df = df.orderBy('score', desc('name'))

sorted_df.show()
Output
+---+-----+-----+ | id| name|score| +---+-----+-----+ | 4|David| 40| | 2| Bob| 45| | 3|Cathy| 50| | 1|Alice| 50| +---+-----+-----+
โš ๏ธ

Common Pitfalls

Common mistakes when using orderBy include:

  • Passing column names as strings without specifying ascending/descending when mixed order is needed.
  • Using sort instead of orderBy (both work but orderBy is preferred for clarity).
  • Not importing desc or asc functions when trying to specify order.
python
from pyspark.sql.functions import desc

# Wrong: trying to pass descending as string
# df.orderBy('score', 'desc')  # This will not sort descending

# Right: use desc() function
# df.orderBy('score', desc('name'))
๐Ÿ“Š

Quick Reference

UsageDescription
df.orderBy('col')Sort by 'col' ascending (default)
df.orderBy(df.col('col').desc())Sort by 'col' descending
df.orderBy('col1', desc('col2'))Sort by 'col1' ascending, then 'col2' descending
df.orderBy(['col1', 'col2'], ascending=[True, False])Sort by multiple columns with specified order
โœ…

Key Takeaways

Use orderBy on a DataFrame to sort by one or more columns.
By default, orderBy sorts columns in ascending order.
Use desc() or asc() functions to specify descending or ascending order explicitly.
You can mix ascending and descending order for different columns in one orderBy call.
Always import desc and asc from pyspark.sql.functions when needed.