Apache Sparkdata~5 mins

Partition tuning (repartition vs coalesce) in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Partition tuning helps your data run faster by changing how data is split across computers. Repartition and coalesce are two ways to do this.

When you want to increase the number of partitions to spread data more evenly.

When you want to reduce the number of partitions to save resources.

When your data is unevenly spread and causes slow processing.

When you want to prepare data for a shuffle operation like join or aggregation.

When you want to avoid a full shuffle to save time and resources.

Syntax

Apache Spark

DataFrame.repartition(numPartitions)
DataFrame.coalesce(numPartitions)

repartition reshuffles all data and can increase or decrease partitions.

coalesce reduces partitions without full shuffle, faster but only for decreasing partitions.

Examples

This creates 10 partitions by fully reshuffling the data.

Apache Spark

df_repart = df.repartition(10)

This reduces partitions to 2 without a full shuffle, faster but only reduces partitions.

Apache Spark

df_coal = df.coalesce(2)

This reshuffles data into 5 partitions, even if the original had 5 partitions.

Apache Spark

df_same = df.repartition(5)

Sample Program

This code creates a simple DataFrame with 100 numbers. It shows how to check the number of partitions, then repartitions to 10 partitions with a full shuffle, and finally coalesces to 5 partitions without a full shuffle.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionTuning').getOrCreate()

# Create a DataFrame with 100 numbers
numbers = spark.range(0, 100)

# Check original number of partitions
original_partitions = numbers.rdd.getNumPartitions()

# Repartition to 10 partitions (full shuffle)
numbers_repart = numbers.repartition(10)
repart_partitions = numbers_repart.rdd.getNumPartitions()

# Coalesce to 5 partitions (no full shuffle)
numbers_coal = numbers_repart.coalesce(5)
coal_partitions = numbers_coal.rdd.getNumPartitions()

print(f'Original partitions: {original_partitions}')
print(f'After repartition to 10: {repart_partitions}')
print(f'After coalesce to 5: {coal_partitions}')

spark.stop()

OutputSuccess

Important Notes

Use repartition when you want to increase or evenly distribute partitions but expect a full shuffle.

Use coalesce to reduce partitions quickly without a full shuffle, but only for decreasing partitions.

Too many partitions can slow down processing; too few can cause data to be unevenly processed.

Summary

Repartition reshuffles data and can increase or decrease partitions.

Coalesce reduces partitions without full shuffle, faster but only decreases partitions.

Choose the right method based on whether you want to increase or decrease partitions and how much shuffle you can afford.