0
0
Apache Sparkdata~5 mins

Handling skewed joins in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Handling skewed joins
O(n)
Understanding Time Complexity

When joining two big datasets in Spark, some keys might appear way more than others. This is called skew.

We want to understand how this skew affects the time it takes to join data.

Scenario Under Consideration

Analyze the time complexity of the following Spark join with skew handling.


val skewedKeys = Seq("key1", "key2")
val largeDF = spark.read.parquet("large_data")
val smallDF = spark.read.parquet("small_data")

val saltedLarge = largeDF.withColumn("salt", when(col("key").isin(skewedKeys: _*), (rand() * 10).cast("int")).otherwise(lit(0)))
val saltedSmall = smallDF.withColumn("salt", when(col("key").isin(skewedKeys: _*), explode(array((0 to 9).map(lit): _*))).otherwise(lit(0)))

val joined = saltedLarge.join(saltedSmall, Seq("key", "salt"))
    

This code splits skewed keys into smaller parts to balance the join work.

Identify Repeating Operations

Look at what repeats when joining skewed keys.

  • Primary operation: Joining data on keys, with extra splitting (salting) for skewed keys.
  • How many times: For skewed keys, join happens multiple times (here 10 times) due to salt values.
How Execution Grows With Input

Without salting, skewed keys cause one big join task that grows very large.

Input Size (n)Approx. Operations
10 skewed rows1 big join task
100 skewed rows1 very big join task
1000 skewed rows1 huge join task causing delay

With salting, the big join splits into smaller parts, each smaller and faster.

Final Time Complexity

Time Complexity: O(n)

This means the join time grows roughly in direct proportion to the total data size, but salting helps avoid slowdowns from skew.

Common Mistake

[X] Wrong: "Adding salting makes the join slower because it repeats work multiple times."

[OK] Correct: Salting splits big skewed tasks into smaller ones that run in parallel, so overall join finishes faster despite some repeated work.

Interview Connect

Understanding how skew affects joins and how salting helps shows you can handle real data challenges and keep Spark jobs running smoothly.

Self-Check

"What if we increased the number of salt splits from 10 to 100? How would the time complexity change?"