0
0
Apache Sparkdata~5 mins

Partition tuning (repartition vs coalesce) in Apache Spark

Choose your learning style9 modes available
Introduction

Partition tuning helps your data run faster by changing how data is split across computers. Repartition and coalesce are two ways to do this.

When you want to increase the number of partitions to spread data more evenly.
When you want to reduce the number of partitions to save resources.
When your data is unevenly spread and causes slow processing.
When you want to prepare data for a shuffle operation like join or aggregation.
When you want to avoid a full shuffle to save time and resources.
Syntax
Apache Spark
DataFrame.repartition(numPartitions)
DataFrame.coalesce(numPartitions)

repartition reshuffles all data and can increase or decrease partitions.

coalesce reduces partitions without full shuffle, faster but only for decreasing partitions.

Examples
This creates 10 partitions by fully reshuffling the data.
Apache Spark
df_repart = df.repartition(10)
This reduces partitions to 2 without a full shuffle, faster but only reduces partitions.
Apache Spark
df_coal = df.coalesce(2)
This reshuffles data into 5 partitions, even if the original had 5 partitions.
Apache Spark
df_same = df.repartition(5)
Sample Program

This code creates a simple DataFrame with 100 numbers. It shows how to check the number of partitions, then repartitions to 10 partitions with a full shuffle, and finally coalesces to 5 partitions without a full shuffle.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionTuning').getOrCreate()

# Create a DataFrame with 100 numbers
numbers = spark.range(0, 100)

# Check original number of partitions
original_partitions = numbers.rdd.getNumPartitions()

# Repartition to 10 partitions (full shuffle)
numbers_repart = numbers.repartition(10)
repart_partitions = numbers_repart.rdd.getNumPartitions()

# Coalesce to 5 partitions (no full shuffle)
numbers_coal = numbers_repart.coalesce(5)
coal_partitions = numbers_coal.rdd.getNumPartitions()

print(f'Original partitions: {original_partitions}')
print(f'After repartition to 10: {repart_partitions}')
print(f'After coalesce to 5: {coal_partitions}')

spark.stop()
OutputSuccess
Important Notes

Use repartition when you want to increase or evenly distribute partitions but expect a full shuffle.

Use coalesce to reduce partitions quickly without a full shuffle, but only for decreasing partitions.

Too many partitions can slow down processing; too few can cause data to be unevenly processed.

Summary

Repartition reshuffles data and can increase or decrease partitions.

Coalesce reduces partitions without full shuffle, faster but only decreases partitions.

Choose the right method based on whether you want to increase or decrease partitions and how much shuffle you can afford.