Handling skewed joins in Apache Spark - Time & Space Complexity
When joining two big datasets in Spark, some keys might appear way more than others. This is called skew.
We want to understand how this skew affects the time it takes to join data.
Analyze the time complexity of the following Spark join with skew handling.
val skewedKeys = Seq("key1", "key2")
val largeDF = spark.read.parquet("large_data")
val smallDF = spark.read.parquet("small_data")
val saltedLarge = largeDF.withColumn("salt", when(col("key").isin(skewedKeys: _*), (rand() * 10).cast("int")).otherwise(lit(0)))
val saltedSmall = smallDF.withColumn("salt", when(col("key").isin(skewedKeys: _*), explode(array((0 to 9).map(lit): _*))).otherwise(lit(0)))
val joined = saltedLarge.join(saltedSmall, Seq("key", "salt"))
This code splits skewed keys into smaller parts to balance the join work.
Look at what repeats when joining skewed keys.
- Primary operation: Joining data on keys, with extra splitting (salting) for skewed keys.
- How many times: For skewed keys, join happens multiple times (here 10 times) due to salt values.
Without salting, skewed keys cause one big join task that grows very large.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 skewed rows | 1 big join task |
| 100 skewed rows | 1 very big join task |
| 1000 skewed rows | 1 huge join task causing delay |
With salting, the big join splits into smaller parts, each smaller and faster.
Time Complexity: O(n)
This means the join time grows roughly in direct proportion to the total data size, but salting helps avoid slowdowns from skew.
[X] Wrong: "Adding salting makes the join slower because it repeats work multiple times."
[OK] Correct: Salting splits big skewed tasks into smaller ones that run in parallel, so overall join finishes faster despite some repeated work.
Understanding how skew affects joins and how salting helps shows you can handle real data challenges and keep Spark jobs running smoothly.
"What if we increased the number of salt splits from 10 to 100? How would the time complexity change?"