0
0
Apache Sparkdata~10 mins

Partition tuning (repartition vs coalesce) in Apache Spark - Visual Side-by-Side Comparison

Choose your learning style9 modes available
Concept Flow - Partition tuning (repartition vs coalesce)
Start with DataFrame
Choose partition tuning method
repartition
Shuffle data
New partitions
Balanced partitions
Use tuned DataFrame for processing
Shows the choice between repartition and coalesce, their shuffle behavior, and resulting partitions.
Execution Sample
Apache Spark
df = spark.range(20)
df2 = df.repartition(5)
df3 = df2.coalesce(2)
df3.rdd.getNumPartitions()
Create a DataFrame, repartition to 5 partitions, then coalesce to 2 partitions, and check partition count.
Execution Table
StepActionPartitions BeforePartitions AfterShuffle OccursResult
1Create DataFrame with default partitionsN/ADefault (usually 200)NoDataFrame with default partitions
2Repartition to 5 partitionsDefault (~200)5YesData shuffled evenly into 5 partitions
3Coalesce to 2 partitions52No or minimalPartitions merged without full shuffle
4Check number of partitions22NoReturns 2 partitions
5End22NoPartition tuning complete
💡 Partition tuning stops after coalesce reduces partitions to 2 without shuffle.
Variable Tracker
VariableStartAfter repartition(5)After coalesce(2)Final
dfDataFrame with default partitionsSameSameSame
df2N/ADataFrame with 5 partitionsSameSame
df3N/AN/ADataFrame with 2 partitionsSame
num_partitionsN/AN/AN/A2
Key Moments - 3 Insights
Why does repartition cause a shuffle but coalesce usually does not?
Repartition reshuffles all data to evenly distribute across new partitions (see step 2 in execution_table). Coalesce merges existing partitions without full shuffle, so it avoids expensive data movement (step 3).
Can coalesce increase the number of partitions?
No, coalesce only reduces partitions or keeps them the same. To increase partitions, repartition must be used (see step 2 vs step 3).
Why might coalesce result in uneven partition sizes?
Because coalesce merges partitions without reshuffling data, some partitions may become larger than others, causing imbalance (explained in concept_flow).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 2, how many partitions does the DataFrame have after repartition?
A2
B5
C200
DDefault number
💡 Hint
Check the 'Partitions After' column at step 2 in the execution_table.
At which step does the shuffle occur according to the execution_table?
AStep 2
BStep 1
CStep 3
DStep 4
💡 Hint
Look at the 'Shuffle Occurs' column in the execution_table.
If you want to increase partitions from 2 to 5, which method should you use?
Acoalesce
BNeither, partitions cannot increase
Crepartition
DBoth coalesce and repartition
💡 Hint
Refer to key_moments about which method increases partitions.
Concept Snapshot
Partition tuning in Spark:
- repartition(n): reshuffles data, creates n balanced partitions
- coalesce(n): merges partitions, reduces to n without full shuffle
- repartition can increase or decrease partitions
- coalesce only reduces partitions
- repartition is costlier but balances data
- coalesce is cheaper but may cause uneven partitions
Full Transcript
This visual execution shows how Spark partition tuning works using repartition and coalesce. Starting with a DataFrame, repartition reshuffles data to create a specified number of balanced partitions, causing a shuffle. Coalesce merges existing partitions to reduce their number without a full shuffle, which is cheaper but can cause uneven partition sizes. The example code creates a DataFrame, repartitions it to 5 partitions, then coalesces to 2 partitions. The execution table traces each step, showing partition counts and shuffle occurrence. Key moments clarify common confusions like why repartition shuffles and coalesce does not, and that coalesce cannot increase partitions. The visual quiz tests understanding of partition counts and method effects. The snapshot summarizes key points for quick recall.