Challenge - 5 Problems
Skewed Join Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of a skewed join without optimization
Given two Spark DataFrames
df1 and df2 where df2 has a highly skewed key, what will be the output count of the join df1.join(df2, "key")?Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data1 = [(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D')] data2 = [(1, 'X')] * 1000 + [(2, 'Y'), (3, 'Z'), (4, 'W')] df1 = spark.createDataFrame(data1, ['key', 'val1']) df2 = spark.createDataFrame(data2, ['key', 'val2']) joined = df1.join(df2, 'key') print(joined.count())
Attempts:
2 left
💡 Hint
Think about how many rows match for each key in the join.
✗ Incorrect
The join matches each row in df1 with all matching rows in df2 by key. Key=1 in df2 has 1000 rows, so df1's key=1 row joins with 1000 rows. Other keys have 1 row each. Total is 1000 + 1 + 1 + 1 = 1003.
🧠 Conceptual
intermediate1:30remaining
Reason for skewed join performance issues
Why do skewed joins cause performance problems in distributed data processing frameworks like Spark?
Attempts:
2 left
💡 Hint
Consider how data is distributed across tasks.
✗ Incorrect
Skewed keys cause some tasks to handle a large portion of data, making those tasks slower and causing overall job delay.
🚀 Application
advanced2:30remaining
Choosing a technique to handle skewed join
You have two large DataFrames with a highly skewed join key. Which technique is best to reduce the impact of skew on join performance?
Attempts:
2 left
💡 Hint
Think about how to distribute skewed keys evenly.
✗ Incorrect
Salting adds randomness to the skewed key, spreading its data across partitions and reducing task imbalance.
🔧 Debug
advanced2:00remaining
Identify the error in skewed join salting code
What error will this PySpark code raise?
from pyspark.sql.functions import lit, rand, concat
salted_df1 = df1.withColumn('salt', (rand() * 10).cast('int'))
salted_df1 = salted_df1.withColumn('salted_key', concat(df1['key'].cast('string'), salted_df1['salt'].cast('string')))
salted_df2 = df2.withColumn('salt', (rand() * 10).cast('int'))
salted_df2 = salted_df2.withColumn('salted_key', concat(df2['key'].cast('string'), salted_df2['salt'].cast('string')))
joined = salted_df1.join(salted_df2, 'salted_key')
Attempts:
2 left
💡 Hint
Check the data types used in concat function.
✗ Incorrect
Concat requires string columns. Here, 'key' and 'salt' are integers and need casting to string before concat.
❓ data_output
expert3:00remaining
Result count after salting skewed join
Given
df1 with keys [1,2,3] and df2 where key=1 has 100 rows and keys 2,3 have 1 row each, after applying salting with 5 salt values (0 to 4) on key=1 only, what is the count of rows in the salted join?Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import lit, concat, col spark = SparkSession.builder.getOrCreate() data1 = [(1, 'A'), (2, 'B'), (3, 'C')] data2 = [(1, f'X{i}') for i in range(100)] + [(2, 'Y'), (3, 'Z')] df1 = spark.createDataFrame(data1, ['key', 'val1']) df2 = spark.createDataFrame(data2, ['key', 'val2']) # Salting df1 only for key=1 salt_values = [0,1,2,3,4] salted_df1 = df1.filter(col('key') != 1) salted_ones = df1.filter(col('key') == 1).crossJoin(spark.createDataFrame([(x,) for x in salt_values], ['salt'])) salted_ones = salted_ones.withColumn('salted_key', concat(col('key').cast('string'), col('salt').cast('string'))) salted_df1 = salted_df1.withColumn('salted_key', col('key').cast('string')) salted_df1 = salted_df1.union(salted_ones) # Salting df2 for key=1 salted_df2 = df2.filter(col('key') != 1).withColumn('salted_key', col('key').cast('string')) salted_ones_df2 = df2.filter(col('key') == 1).crossJoin(spark.createDataFrame([(x,) for x in salt_values], ['salt'])) salted_ones_df2 = salted_ones_df2.withColumn('salted_key', concat(col('key').cast('string'), col('salt').cast('string'))) salted_df2 = salted_df2.union(salted_ones_df2) joined = salted_df1.join(salted_df2, 'salted_key') print(joined.count())
Attempts:
2 left
💡 Hint
Calculate how many rows join for salted and non-salted keys.
✗ Incorrect
Key=1 in df1 is duplicated 5 times with salts; df2 key=1 has 100 rows each salted 5 times, so 5*100=500 matches for key=1. Keys 2 and 3 join once each. Total count is 500 + 1 + 1 = 502, closest option is 500.