How to use sortByKey spark in pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Use sortByKey in PySpark: Syntax and Example

In PySpark, sortByKey() is used to sort an RDD of key-value pairs by the key in ascending order by default. You can call rdd.sortByKey(ascending=True) to sort keys ascending or ascending=False for descending order.

📐

Syntax

The sortByKey() function sorts an RDD of key-value pairs by the key.

rdd.sortByKey(ascending=True, numPartitions=None)
ascending: Boolean to sort ascending (default) or descending.
numPartitions: Optional number of partitions for the result RDD.

python

sorted_rdd = rdd.sortByKey(ascending=True, numPartitions=None)

💻

Example

This example creates an RDD of key-value pairs and sorts it by key in ascending and descending order using sortByKey().

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName('SortByKeyExample').getOrCreate()
sc = spark.sparkContext

# Create an RDD of key-value pairs
rdd = sc.parallelize([(3, 'apple'), (1, 'banana'), (2, 'cherry'), (4, 'date')])

# Sort by key ascending
sorted_asc = rdd.sortByKey(ascending=True).collect()

# Sort by key descending
sorted_desc = rdd.sortByKey(ascending=False).collect()

print('Ascending:', sorted_asc)
print('Descending:', sorted_desc)

spark.stop()

Output

Ascending: [(1, 'banana'), (2, 'cherry'), (3, 'apple'), (4, 'date')] Descending: [(4, 'date'), (3, 'apple'), (2, 'cherry'), (1, 'banana')]

⚠️

Common Pitfalls

Common mistakes when using sortByKey() include:

Trying to use sortByKey() on an RDD that is not key-value pairs (it requires tuples with keys).
Forgetting that sortByKey() returns a new RDD and does not sort in place.
Not calling an action like collect() or take() to trigger execution and see results.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName('SortByKeyPitfall').getOrCreate()
sc = spark.sparkContext

# Incorrect: RDD without key-value pairs
rdd_wrong = sc.parallelize([3, 1, 2, 4])

# This will raise an error because elements are not key-value pairs
try:
    rdd_wrong.sortByKey().collect()
except Exception as e:
    print('Error:', e)

# Correct: RDD with key-value pairs
rdd_correct = sc.parallelize([(3, 'apple'), (1, 'banana')])
sorted_rdd = rdd_correct.sortByKey().collect()
print('Sorted:', sorted_rdd)

spark.stop()

Output

Error: TypeError: 'int' object is not iterable Sorted: [(1, 'banana'), (3, 'apple')]

📊

Quick Reference

Parameter	Description	Default
ascending	Sort keys in ascending order if True, descending if False	True
numPartitions	Number of partitions for the sorted RDD	None (same as original)

✅

Key Takeaways

Use sortByKey() on RDDs of key-value pairs to sort by keys easily.

By default, sortByKey() sorts keys in ascending order; set ascending=False for descending.

sortByKey() returns a new RDD; call an action like collect() to see results.

Ensure your RDD elements are tuples (key, value) before using sortByKey().

You can control the number of partitions in the sorted RDD with numPartitions.