Apache Sparkdata~10 mins

Handling skewed joins in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Handling skewed joins

Start Join Operation

↓

Detect Skewed Keys?

No→Perform Normal Join

Yes↓

Apply Skew Join Handling

↓

Split Skewed Keys

↓

Join Skewed Keys Separately

↓

Combine Results

↓

Output Result

The join operation first checks for skewed keys. If none, it performs a normal join. If skewed keys exist, it splits and joins them separately, then combines results.

Execution Sample

Apache Spark

from pyspark.sql.functions import broadcast

skewed_keys = ['key1']

# Split skewed keys
skewed_df = df1.filter(df1.key.isin(skewed_keys))
rest_df = df1.filter(~df1.key.isin(skewed_keys))

# Join skewed keys with broadcast
join_skewed = skewed_df.join(broadcast(df2), 'key')

# Join rest normally
join_rest = rest_df.join(df2, 'key')

# Combine results
final_join = join_skewed.union(join_rest)

This code splits the data on skewed keys, joins skewed keys with broadcast join, joins the rest normally, then combines results.

Execution Table

Step	Action	DataFrame State	Join Type	Output Rows
1	Identify skewed keys	skewed_keys = ['key1']	N/A	N/A
2	Filter skewed keys from df1	skewed_df contains rows with key1	N/A	N/A
3	Filter rest keys from df1	rest_df contains rows without key1	N/A	N/A
4	Join skewed_df with df2 using broadcast	Joining skewed_df and df2 on key1	Broadcast Join	1000 rows (skewed key)
5	Join rest_df with df2 normally	Joining rest_df and df2 on other keys	Shuffle Join	9000 rows (non-skewed keys)
6	Combine join_skewed and join_rest	Union of both join results	N/A	10000 rows total
7	End of join operation	Final joined DataFrame ready	N/A	10000 rows total

💡 All rows joined; skewed keys handled separately to avoid performance issues.

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 4	After Step 5	Final
skewed_keys	[]	['key1']	['key1']	['key1']	['key1']	['key1']
skewed_df	empty	rows with key1	rows with key1	rows with key1	rows with key1	rows with key1
rest_df	empty	empty	rows without key1	rows without key1	rows without key1	rows without key1
join_skewed	empty	empty	empty	joined rows with key1	joined rows with key1	joined rows with key1
join_rest	empty	empty	empty	empty	joined rows without key1	joined rows without key1
final_join	empty	empty	empty	empty	empty	union of skewed and rest joins

Key Moments - 3 Insights

Why do we split the data into skewed and non-skewed parts before joining?

What happens if we do not handle skewed keys separately?

Why do we use broadcast join for skewed keys?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what type of join is used for the skewed keys?

ABroadcast Join

BShuffle Join

CSort Merge Join

DCartesian Join

Concept Snapshot

Handling skewed joins in Spark:
- Detect skewed keys causing uneven data distribution
- Split data into skewed and non-skewed parts
- Join skewed keys using broadcast join
- Join rest normally
- Combine results to get final joined DataFrame
This avoids slow shuffle caused by skew.

Full Transcript

In Spark, skewed joins happen when some keys have many more rows than others, causing slow joins. To handle this, we first detect skewed keys. Then we split the data into two parts: one with skewed keys and one without. We join the skewed keys separately using a broadcast join, which avoids heavy shuffles. The rest of the data is joined normally. Finally, we combine both join results. This method improves performance by balancing the workload across the cluster.