How to Use Coalesce in Spark: Syntax and Examples
In Spark,
coalesce is used to reduce the number of partitions in a DataFrame or RDD without a full shuffle, making it efficient for decreasing partitions. You call it with the desired number of partitions like df.coalesce(numPartitions).Syntax
The coalesce function is called on a DataFrame or RDD to reduce its partitions. It takes one required argument:
- numPartitions: The target number of partitions to reduce to.
It does not perform a full shuffle, so it is faster than repartition when decreasing partitions.
scala
df.coalesce(numPartitions)
Example
This example shows how to create a Spark DataFrame with 4 partitions and then reduce it to 2 partitions using coalesce. It prints the number of partitions before and after.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CoalesceExample").getOrCreate() # Create a DataFrame with 4 partitions rdd = spark.sparkContext.parallelize(range(10), 4) df = rdd.toDF(["number"]) # Show initial number of partitions initial_partitions = df.rdd.getNumPartitions() print(f"Initial partitions: {initial_partitions}") # Use coalesce to reduce partitions to 2 coalesced_df = df.coalesce(2) # Show number of partitions after coalesce final_partitions = coalesced_df.rdd.getNumPartitions() print(f"Partitions after coalesce: {final_partitions}") spark.stop()
Output
Initial partitions: 4
Partitions after coalesce: 2
Common Pitfalls
1. Using coalesce to increase partitions: coalesce is designed to reduce partitions only. If you try to increase partitions, it will not work as expected. Use repartition instead.
2. Data skew: Since coalesce avoids shuffle, some partitions may become very large, causing uneven data distribution.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CoalescePitfall").getOrCreate() rdd = spark.sparkContext.parallelize(range(10), 2) df = rdd.toDF(["num"]) # Wrong: Trying to increase partitions with coalesce (will not increase) increased_df = df.coalesce(4) print(f"Partitions after coalesce to increase: {increased_df.rdd.getNumPartitions()}") # Still 2 # Correct: Use repartition to increase partitions repartitioned_df = df.repartition(4) print(f"Partitions after repartition: {repartitioned_df.rdd.getNumPartitions()}") spark.stop()
Output
Partitions after coalesce to increase: 2
Partitions after repartition: 4
Quick Reference
- coalesce(numPartitions): Reduce partitions without shuffle, faster but can cause data skew.
- repartition(numPartitions): Increase or decrease partitions with shuffle, balanced data but slower.
- Use
coalesceonly to reduce partitions efficiently.
Key Takeaways
Use
coalesce to reduce the number of partitions efficiently without shuffle.Do not use
coalesce to increase partitions; use repartition instead.Coalesce can cause uneven data distribution because it avoids shuffle.
Check the number of partitions with
rdd.getNumPartitions() to verify changes.Choose coalesce for performance when decreasing partitions and repartition for balanced data.