0
0
Apache-sparkComparisonBeginner · 3 min read

Repartition vs Coalesce in PySpark: Key Differences and Usage

In PySpark, repartition reshuffles data across the cluster and can increase or decrease partitions with a full shuffle, while coalesce reduces partitions without a full shuffle by collapsing existing partitions. Use repartition for large data reshuffling and coalesce for efficient partition reduction.
⚖️

Quick Comparison

This table summarizes the main differences between repartition and coalesce in PySpark.

Factorrepartitioncoalesce
Operation TypeFull shuffle of dataNo full shuffle, collapses partitions
Partition IncreaseYes, can increase partitionsNo, only reduces partitions
Partition DecreaseYes, with shuffleYes, without shuffle (default)
PerformanceSlower due to shuffleFaster for reducing partitions
Use CaseWhen repartitioning or increasing partitionsWhen reducing partitions efficiently
Shuffle BehaviorAlways triggers shuffleShuffle optional with shuffle=True
⚖️

Key Differences

repartition always triggers a full shuffle of the data across the cluster. This means it redistributes the data evenly into the specified number of partitions, which can be more expensive but ensures balanced partitions. It can be used to both increase and decrease the number of partitions.

On the other hand, coalesce is optimized for reducing the number of partitions without a full shuffle. It simply merges existing partitions, which is faster but can lead to uneven partition sizes. By default, coalesce does not shuffle data, but you can enable shuffle by setting shuffle=True to rebalance partitions if needed.

In summary, use repartition when you want to evenly redistribute data or increase partitions, and use coalesce when you want a quick way to reduce partitions without the overhead of a shuffle.

⚖️

Code Comparison

Here is how you use repartition in PySpark to change the number of partitions of a DataFrame.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RepartitionExample").getOrCreate()
data = [(1, "a"), (2, "b"), (3, "c"), (4, "d")]
df = spark.createDataFrame(data, ["id", "value"])

print(f"Original partitions: {df.rdd.getNumPartitions()}")

# Repartition to 3 partitions (full shuffle)
df_repartitioned = df.repartition(3)
print(f"Partitions after repartition: {df_repartitioned.rdd.getNumPartitions()}")

spark.stop()
Output
Original partitions: 1 Partitions after repartition: 3
↔️

Coalesce Equivalent

Here is how you use coalesce in PySpark to reduce the number of partitions without a full shuffle.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CoalesceExample").getOrCreate()
data = [(1, "a"), (2, "b"), (3, "c"), (4, "d")]
df = spark.createDataFrame(data, ["id", "value"]).repartition(4)

print(f"Partitions before coalesce: {df.rdd.getNumPartitions()}")

# Coalesce to 2 partitions (no shuffle by default)
df_coalesced = df.coalesce(2)
print(f"Partitions after coalesce: {df_coalesced.rdd.getNumPartitions()}")

spark.stop()
Output
Partitions before coalesce: 4 Partitions after coalesce: 2
🎯

When to Use Which

Choose repartition when:

  • You need to increase the number of partitions.
  • You want to evenly distribute data across partitions for better parallelism.
  • You want to shuffle data to balance skewed partitions.

Choose coalesce when:

  • You want to reduce the number of partitions efficiently without the cost of a full shuffle.
  • Your data is already well distributed and you just want fewer partitions.
  • You want faster execution by avoiding shuffle overhead.

Key Takeaways

repartition always triggers a full shuffle and can increase or decrease partitions.
coalesce reduces partitions without shuffle by default, making it faster for downsizing.
Use repartition to balance data or increase partitions.
Use coalesce to quickly reduce partitions when shuffle is not needed.
Shuffle with coalesce is optional but less common.