0
0
Apache Sparkdata~10 mins

Handling skewed joins in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Handling skewed joins
Start Join Operation
Detect Skewed Keys?
NoPerform Normal Join
Yes
Apply Skew Join Handling
Split Skewed Keys
Join Skewed Keys Separately
Combine Results
Output Result
The join operation first checks for skewed keys. If none, it performs a normal join. If skewed keys exist, it splits and joins them separately, then combines results.
Execution Sample
Apache Spark
from pyspark.sql.functions import broadcast

skewed_keys = ['key1']

# Split skewed keys
skewed_df = df1.filter(df1.key.isin(skewed_keys))
rest_df = df1.filter(~df1.key.isin(skewed_keys))

# Join skewed keys with broadcast
join_skewed = skewed_df.join(broadcast(df2), 'key')

# Join rest normally
join_rest = rest_df.join(df2, 'key')

# Combine results
final_join = join_skewed.union(join_rest)
This code splits the data on skewed keys, joins skewed keys with broadcast join, joins the rest normally, then combines results.
Execution Table
StepActionDataFrame StateJoin TypeOutput Rows
1Identify skewed keysskewed_keys = ['key1']N/AN/A
2Filter skewed keys from df1skewed_df contains rows with key1N/AN/A
3Filter rest keys from df1rest_df contains rows without key1N/AN/A
4Join skewed_df with df2 using broadcastJoining skewed_df and df2 on key1Broadcast Join1000 rows (skewed key)
5Join rest_df with df2 normallyJoining rest_df and df2 on other keysShuffle Join9000 rows (non-skewed keys)
6Combine join_skewed and join_restUnion of both join resultsN/A10000 rows total
7End of join operationFinal joined DataFrame readyN/A10000 rows total
💡 All rows joined; skewed keys handled separately to avoid performance issues.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4After Step 5Final
skewed_keys[]['key1']['key1']['key1']['key1']['key1']
skewed_dfemptyrows with key1rows with key1rows with key1rows with key1rows with key1
rest_dfemptyemptyrows without key1rows without key1rows without key1rows without key1
join_skewedemptyemptyemptyjoined rows with key1joined rows with key1joined rows with key1
join_restemptyemptyemptyemptyjoined rows without key1joined rows without key1
final_joinemptyemptyemptyemptyemptyunion of skewed and rest joins
Key Moments - 3 Insights
Why do we split the data into skewed and non-skewed parts before joining?
Splitting allows us to handle skewed keys separately with broadcast join, avoiding heavy shuffles and improving performance, as shown in steps 2-5 of the execution table.
What happens if we do not handle skewed keys separately?
The join will cause data skew, leading to some tasks taking much longer and slowing down the whole job, because the large skewed key causes uneven data distribution during shuffle.
Why do we use broadcast join for skewed keys?
Broadcast join sends the smaller DataFrame to all nodes, avoiding shuffle for skewed keys, which reduces data movement and speeds up the join, as seen in step 4.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what type of join is used for the skewed keys?
ABroadcast Join
BShuffle Join
CSort Merge Join
DCartesian Join
💡 Hint
Check step 4 in the execution table where skewed keys are joined.
At which step do we combine the results of skewed and non-skewed joins?
AStep 3
BStep 6
CStep 5
DStep 7
💡 Hint
Look for the step mentioning union of join results.
If we did not filter out skewed keys before joining, what would likely happen?
AThe join would be faster
BThe join would produce fewer rows
CThe join would cause data skew and slow down
DThe join would fail with error
💡 Hint
Refer to the key moment about consequences of not handling skewed keys.
Concept Snapshot
Handling skewed joins in Spark:
- Detect skewed keys causing uneven data distribution
- Split data into skewed and non-skewed parts
- Join skewed keys using broadcast join
- Join rest normally
- Combine results to get final joined DataFrame
This avoids slow shuffle caused by skew.
Full Transcript
In Spark, skewed joins happen when some keys have many more rows than others, causing slow joins. To handle this, we first detect skewed keys. Then we split the data into two parts: one with skewed keys and one without. We join the skewed keys separately using a broadcast join, which avoids heavy shuffles. The rest of the data is joined normally. Finally, we combine both join results. This method improves performance by balancing the workload across the cluster.