0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use Coalesce in Spark: Syntax and Examples

In Spark, coalesce is used to reduce the number of partitions in a DataFrame or RDD without a full shuffle, making it efficient for decreasing partitions. You call it with the desired number of partitions like df.coalesce(numPartitions).
๐Ÿ“

Syntax

The coalesce function is called on a DataFrame or RDD to reduce its partitions. It takes one required argument:

  • numPartitions: The target number of partitions to reduce to.

It does not perform a full shuffle, so it is faster than repartition when decreasing partitions.

scala
df.coalesce(numPartitions)
๐Ÿ’ป

Example

This example shows how to create a Spark DataFrame with 4 partitions and then reduce it to 2 partitions using coalesce. It prints the number of partitions before and after.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CoalesceExample").getOrCreate()

# Create a DataFrame with 4 partitions
rdd = spark.sparkContext.parallelize(range(10), 4)
df = rdd.toDF(["number"])

# Show initial number of partitions
initial_partitions = df.rdd.getNumPartitions()
print(f"Initial partitions: {initial_partitions}")

# Use coalesce to reduce partitions to 2
coalesced_df = df.coalesce(2)

# Show number of partitions after coalesce
final_partitions = coalesced_df.rdd.getNumPartitions()
print(f"Partitions after coalesce: {final_partitions}")

spark.stop()
Output
Initial partitions: 4 Partitions after coalesce: 2
โš ๏ธ

Common Pitfalls

1. Using coalesce to increase partitions: coalesce is designed to reduce partitions only. If you try to increase partitions, it will not work as expected. Use repartition instead.

2. Data skew: Since coalesce avoids shuffle, some partitions may become very large, causing uneven data distribution.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CoalescePitfall").getOrCreate()

rdd = spark.sparkContext.parallelize(range(10), 2)
df = rdd.toDF(["num"])

# Wrong: Trying to increase partitions with coalesce (will not increase)
increased_df = df.coalesce(4)
print(f"Partitions after coalesce to increase: {increased_df.rdd.getNumPartitions()}")  # Still 2

# Correct: Use repartition to increase partitions
repartitioned_df = df.repartition(4)
print(f"Partitions after repartition: {repartitioned_df.rdd.getNumPartitions()}")

spark.stop()
Output
Partitions after coalesce to increase: 2 Partitions after repartition: 4
๐Ÿ“Š

Quick Reference

  • coalesce(numPartitions): Reduce partitions without shuffle, faster but can cause data skew.
  • repartition(numPartitions): Increase or decrease partitions with shuffle, balanced data but slower.
  • Use coalesce only to reduce partitions efficiently.
โœ…

Key Takeaways

Use coalesce to reduce the number of partitions efficiently without shuffle.
Do not use coalesce to increase partitions; use repartition instead.
Coalesce can cause uneven data distribution because it avoids shuffle.
Check the number of partitions with rdd.getNumPartitions() to verify changes.
Choose coalesce for performance when decreasing partitions and repartition for balanced data.