Partition tuning (repartition vs coalesce) in Apache Spark - Performance Comparison
When working with big data in Spark, how fast your data moves and changes shape matters a lot.
We want to understand how repartitioning data affects the time it takes as data size grows.
Analyze the time complexity of the following code snippet.
// Starting with a DataFrame df
val repartitionedDF = df.repartition(100) // reshuffles data into 100 partitions
val coalescedDF = repartitionedDF.coalesce(10) // reduces partitions to 10 without full shuffle
repartitionedDF.count() // triggers action
coalescedDF.count() // triggers action
This code changes the number of partitions in a DataFrame using repartition and coalesce, then counts rows to trigger execution.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Data shuffle during repartition, which moves all data across the cluster.
- How many times: The shuffle touches every data record once, redistributing them into new partitions.
- Coalesce operation: Moves fewer data records because it avoids full shuffle, only merging partitions.
As the data size grows, the shuffle during repartition moves more records, so time grows roughly in direct proportion to data size.
| Input Size (n) | Approx. Operations |
|---|---|
| 10,000 rows | ~10,000 data moves during shuffle |
| 100,000 rows | ~100,000 data moves during shuffle |
| 1,000,000 rows | ~1,000,000 data moves during shuffle |
Pattern observation: The time to repartition grows roughly linearly with the number of rows because each row is moved once.
Time Complexity: O(n)
This means the time to repartition grows linearly with the amount of data you have.
[X] Wrong: "Coalesce always takes the same time as repartition because both change partitions."
[OK] Correct: Coalesce avoids a full shuffle and moves less data, so it usually runs faster and scales better with data size.
Understanding how repartition and coalesce scale helps you explain data movement costs clearly, a key skill in big data roles.
"What if we increased the number of partitions in coalesce instead of decreasing? How would that affect time complexity?"