How to Use Repartition in Spark for Data Distribution
repartition(numPartitions) in Spark to change the number of partitions of a DataFrame or RDD. This helps distribute data evenly across the cluster for better parallel processing and performance.Syntax
The repartition method is called on a DataFrame or RDD to increase or decrease the number of partitions. It takes one required argument:
numPartitions: The target number of partitions you want.
Example: df.repartition(5) creates 5 partitions.
df.repartition(numPartitions)
Example
This example shows how to create a Spark DataFrame, check its initial number of partitions, then use repartition to change it to 3 partitions.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RepartitionExample").getOrCreate() # Create a DataFrame with 10 numbers data = [(i,) for i in range(10)] df = spark.createDataFrame(data, ["number"]) # Check initial number of partitions initial_partitions = df.rdd.getNumPartitions() # Repartition to 3 partitions repartitioned_df = df.repartition(3) new_partitions = repartitioned_df.rdd.getNumPartitions() print(f"Initial partitions: {initial_partitions}") print(f"Partitions after repartition: {new_partitions}") spark.stop()
Common Pitfalls
1. Using repartition unnecessarily: Repartition causes a full shuffle of data, which is expensive. Use it only when you need to change partition count or balance data.
2. Confusing repartition with coalesce: coalesce reduces partitions without shuffle and is cheaper but cannot increase partitions reliably.
3. Not checking partition count: Always check the number of partitions before and after repartition to ensure it worked as expected.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("PitfallExample").getOrCreate() data = [(i,) for i in range(10)] df = spark.createDataFrame(data, ["number"]) # Wrong: Using repartition when not needed (causes shuffle) df_repartitioned = df.repartition(1) # Forces shuffle even if 1 partition exists # Right: Use coalesce to reduce partitions without shuffle # df_coalesced = df.coalesce(1) # More efficient if reducing partitions spark.stop()
Quick Reference
| Method | Description | Shuffle | Use Case |
|---|---|---|---|
| repartition(numPartitions) | Changes number of partitions with shuffle | Yes | Increase or evenly redistribute partitions |
| coalesce(numPartitions) | Reduces partitions without shuffle | No | Decrease partitions efficiently |