Repartition vs coalesce spark in pyspark

Apache-sparkComparisonBeginner · 3 min read

Repartition vs Coalesce in PySpark: Key Differences and Usage

In PySpark, repartition reshuffles data across the cluster and can increase or decrease partitions with a full shuffle, while coalesce reduces partitions without a full shuffle by collapsing existing partitions. Use repartition for large data reshuffling and coalesce for efficient partition reduction.

⚖️

Quick Comparison

This table summarizes the main differences between repartition and coalesce in PySpark.

Factor	repartition	coalesce
Operation Type	Full shuffle of data	No full shuffle, collapses partitions
Partition Increase	Yes, can increase partitions	No, only reduces partitions
Partition Decrease	Yes, with shuffle	Yes, without shuffle (default)
Performance	Slower due to shuffle	Faster for reducing partitions
Use Case	When repartitioning or increasing partitions	When reducing partitions efficiently
Shuffle Behavior	Always triggers shuffle	Shuffle optional with shuffle=True

⚖️

Key Differences

repartition always triggers a full shuffle of the data across the cluster. This means it redistributes the data evenly into the specified number of partitions, which can be more expensive but ensures balanced partitions. It can be used to both increase and decrease the number of partitions.

On the other hand, coalesce is optimized for reducing the number of partitions without a full shuffle. It simply merges existing partitions, which is faster but can lead to uneven partition sizes. By default, coalesce does not shuffle data, but you can enable shuffle by setting shuffle=True to rebalance partitions if needed.

In summary, use repartition when you want to evenly redistribute data or increase partitions, and use coalesce when you want a quick way to reduce partitions without the overhead of a shuffle.

⚖️

Code Comparison

Here is how you use repartition in PySpark to change the number of partitions of a DataFrame.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RepartitionExample").getOrCreate()
data = [(1, "a"), (2, "b"), (3, "c"), (4, "d")]
df = spark.createDataFrame(data, ["id", "value"])

print(f"Original partitions: {df.rdd.getNumPartitions()}")

# Repartition to 3 partitions (full shuffle)
df_repartitioned = df.repartition(3)
print(f"Partitions after repartition: {df_repartitioned.rdd.getNumPartitions()}")

spark.stop()

Output

Original partitions: 1 Partitions after repartition: 3

↔️

Coalesce Equivalent

Here is how you use coalesce in PySpark to reduce the number of partitions without a full shuffle.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CoalesceExample").getOrCreate()
data = [(1, "a"), (2, "b"), (3, "c"), (4, "d")]
df = spark.createDataFrame(data, ["id", "value"]).repartition(4)

print(f"Partitions before coalesce: {df.rdd.getNumPartitions()}")

# Coalesce to 2 partitions (no shuffle by default)
df_coalesced = df.coalesce(2)
print(f"Partitions after coalesce: {df_coalesced.rdd.getNumPartitions()}")

spark.stop()

Output

Partitions before coalesce: 4 Partitions after coalesce: 2

🎯

When to Use Which

Choose repartition when:

You need to increase the number of partitions.
You want to evenly distribute data across partitions for better parallelism.
You want to shuffle data to balance skewed partitions.

Choose coalesce when:

You want to reduce the number of partitions efficiently without the cost of a full shuffle.
Your data is already well distributed and you just want fewer partitions.
You want faster execution by avoiding shuffle overhead.

✅

Key Takeaways

repartition always triggers a full shuffle and can increase or decrease partitions.

coalesce reduces partitions without shuffle by default, making it faster for downsizing.

Use repartition to balance data or increase partitions.

Use coalesce to quickly reduce partitions when shuffle is not needed.

Shuffle with coalesce is optional but less common.