Partition tuning helps your data run faster by changing how data is split across computers. Repartition and coalesce are two ways to do this.
Partition tuning (repartition vs coalesce) in Apache Spark
DataFrame.repartition(numPartitions) DataFrame.coalesce(numPartitions)
repartition reshuffles all data and can increase or decrease partitions.
coalesce reduces partitions without full shuffle, faster but only for decreasing partitions.
df_repart = df.repartition(10)df_coal = df.coalesce(2)df_same = df.repartition(5)This code creates a simple DataFrame with 100 numbers. It shows how to check the number of partitions, then repartitions to 10 partitions with a full shuffle, and finally coalesces to 5 partitions without a full shuffle.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('PartitionTuning').getOrCreate() # Create a DataFrame with 100 numbers numbers = spark.range(0, 100) # Check original number of partitions original_partitions = numbers.rdd.getNumPartitions() # Repartition to 10 partitions (full shuffle) numbers_repart = numbers.repartition(10) repart_partitions = numbers_repart.rdd.getNumPartitions() # Coalesce to 5 partitions (no full shuffle) numbers_coal = numbers_repart.coalesce(5) coal_partitions = numbers_coal.rdd.getNumPartitions() print(f'Original partitions: {original_partitions}') print(f'After repartition to 10: {repart_partitions}') print(f'After coalesce to 5: {coal_partitions}') spark.stop()
Use repartition when you want to increase or evenly distribute partitions but expect a full shuffle.
Use coalesce to reduce partitions quickly without a full shuffle, but only for decreasing partitions.
Too many partitions can slow down processing; too few can cause data to be unevenly processed.
Repartition reshuffles data and can increase or decrease partitions.
Coalesce reduces partitions without full shuffle, faster but only decreases partitions.
Choose the right method based on whether you want to increase or decrease partitions and how much shuffle you can afford.