Repartition vs Coalesce in PySpark: Key Differences and Usage
repartition reshuffles data across the cluster and can increase or decrease partitions with a full shuffle, while coalesce reduces partitions without a full shuffle by collapsing existing partitions. Use repartition for large data reshuffling and coalesce for efficient partition reduction.Quick Comparison
This table summarizes the main differences between repartition and coalesce in PySpark.
| Factor | repartition | coalesce |
|---|---|---|
| Operation Type | Full shuffle of data | No full shuffle, collapses partitions |
| Partition Increase | Yes, can increase partitions | No, only reduces partitions |
| Partition Decrease | Yes, with shuffle | Yes, without shuffle (default) |
| Performance | Slower due to shuffle | Faster for reducing partitions |
| Use Case | When repartitioning or increasing partitions | When reducing partitions efficiently |
| Shuffle Behavior | Always triggers shuffle | Shuffle optional with shuffle=True |
Key Differences
repartition always triggers a full shuffle of the data across the cluster. This means it redistributes the data evenly into the specified number of partitions, which can be more expensive but ensures balanced partitions. It can be used to both increase and decrease the number of partitions.
On the other hand, coalesce is optimized for reducing the number of partitions without a full shuffle. It simply merges existing partitions, which is faster but can lead to uneven partition sizes. By default, coalesce does not shuffle data, but you can enable shuffle by setting shuffle=True to rebalance partitions if needed.
In summary, use repartition when you want to evenly redistribute data or increase partitions, and use coalesce when you want a quick way to reduce partitions without the overhead of a shuffle.
Code Comparison
Here is how you use repartition in PySpark to change the number of partitions of a DataFrame.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RepartitionExample").getOrCreate() data = [(1, "a"), (2, "b"), (3, "c"), (4, "d")] df = spark.createDataFrame(data, ["id", "value"]) print(f"Original partitions: {df.rdd.getNumPartitions()}") # Repartition to 3 partitions (full shuffle) df_repartitioned = df.repartition(3) print(f"Partitions after repartition: {df_repartitioned.rdd.getNumPartitions()}") spark.stop()
Coalesce Equivalent
Here is how you use coalesce in PySpark to reduce the number of partitions without a full shuffle.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CoalesceExample").getOrCreate() data = [(1, "a"), (2, "b"), (3, "c"), (4, "d")] df = spark.createDataFrame(data, ["id", "value"]).repartition(4) print(f"Partitions before coalesce: {df.rdd.getNumPartitions()}") # Coalesce to 2 partitions (no shuffle by default) df_coalesced = df.coalesce(2) print(f"Partitions after coalesce: {df_coalesced.rdd.getNumPartitions()}") spark.stop()
When to Use Which
Choose repartition when:
- You need to increase the number of partitions.
- You want to evenly distribute data across partitions for better parallelism.
- You want to shuffle data to balance skewed partitions.
Choose coalesce when:
- You want to reduce the number of partitions efficiently without the cost of a full shuffle.
- Your data is already well distributed and you just want fewer partitions.
- You want faster execution by avoiding shuffle overhead.
Key Takeaways
repartition always triggers a full shuffle and can increase or decrease partitions.coalesce reduces partitions without shuffle by default, making it faster for downsizing.repartition to balance data or increase partitions.coalesce to quickly reduce partitions when shuffle is not needed.coalesce is optional but less common.