Apache Sparkdata~10 mins

Partition tuning (repartition vs coalesce) in Apache Spark - Visual Side-by-Side Comparison

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Partition tuning (repartition vs coalesce)

Start with DataFrame

↓

Choose partition tuning method

↓

repartition

↓

Shuffle data

↓

New partitions

↓

Balanced partitions

↓

Use tuned DataFrame for processing

Shows the choice between repartition and coalesce, their shuffle behavior, and resulting partitions.

Execution Sample

Apache Spark

df = spark.range(20)
df2 = df.repartition(5)
df3 = df2.coalesce(2)
df3.rdd.getNumPartitions()

Create a DataFrame, repartition to 5 partitions, then coalesce to 2 partitions, and check partition count.

Execution Table

Step	Action	Partitions Before	Partitions After	Shuffle Occurs	Result
1	Create DataFrame with default partitions	N/A	Default (usually 200)	No	DataFrame with default partitions
2	Repartition to 5 partitions	Default (~200)	5	Yes	Data shuffled evenly into 5 partitions
3	Coalesce to 2 partitions	5	2	No or minimal	Partitions merged without full shuffle
4	Check number of partitions	2	2	No	Returns 2 partitions
5	End	2	2	No	Partition tuning complete

💡 Partition tuning stops after coalesce reduces partitions to 2 without shuffle.

Variable Tracker

Variable	Start	After repartition(5)	After coalesce(2)	Final
df	DataFrame with default partitions	Same	Same	Same
df2	N/A	DataFrame with 5 partitions	Same	Same
df3	N/A	N/A	DataFrame with 2 partitions	Same
num_partitions	N/A	N/A	N/A	2

Key Moments - 3 Insights

Why does repartition cause a shuffle but coalesce usually does not?

Can coalesce increase the number of partitions?

Why might coalesce result in uneven partition sizes?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at step 2, how many partitions does the DataFrame have after repartition?

C200

DDefault number

Concept Snapshot

Partition tuning in Spark:
- repartition(n): reshuffles data, creates n balanced partitions
- coalesce(n): merges partitions, reduces to n without full shuffle
- repartition can increase or decrease partitions
- coalesce only reduces partitions
- repartition is costlier but balances data
- coalesce is cheaper but may cause uneven partitions

Full Transcript

This visual execution shows how Spark partition tuning works using repartition and coalesce. Starting with a DataFrame, repartition reshuffles data to create a specified number of balanced partitions, causing a shuffle. Coalesce merges existing partitions to reduce their number without a full shuffle, which is cheaper but can cause uneven partition sizes. The example code creates a DataFrame, repartitions it to 5 partitions, then coalesces to 2 partitions. The execution table traces each step, showing partition counts and shuffle occurrence. Key moments clarify common confusions like why repartition shuffles and coalesce does not, and that coalesce cannot increase partitions. The visual quiz tests understanding of partition counts and method effects. The snapshot summarizes key points for quick recall.