Challenge - 5 Problems
Cross Join Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of a simple cross join in Spark
What is the output of the following Spark code snippet?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data1 = [(1, "A"), (2, "B")] data2 = [("X", 10), ("Y", 20)] df1 = spark.createDataFrame(data1, ["id", "val1"]) df2 = spark.createDataFrame(data2, ["val2", "num"]) cross_joined = df1.crossJoin(df2) cross_joined.show()
Attempts:
2 left
💡 Hint
Remember that a cross join pairs every row of the first DataFrame with every row of the second DataFrame.
✗ Incorrect
A cross join returns the Cartesian product of the two DataFrames. Since df1 has 2 rows and df2 has 2 rows, the result has 2*2=4 rows combining all pairs.
🧠 Conceptual
intermediate1:30remaining
When to avoid cross joins in Spark
Which of the following is the best reason to avoid cross joins in Spark?
Attempts:
2 left
💡 Hint
Think about what happens when you combine every row with every other row.
✗ Incorrect
Cross joins create the Cartesian product, which can explode the number of rows and cause performance and memory issues.
🔧 Debug
advanced2:00remaining
Identify the error in cross join usage
What error will this Spark code raise?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data1 = [(1, "A")] data2 = [("X", 10)] df1 = spark.createDataFrame(data1, ["id", "val1"]) df2 = spark.createDataFrame(data2, ["val2", "num"]) result = df1.join(df2) result.show()
Attempts:
2 left
💡 Hint
Check the join method usage without join condition.
✗ Incorrect
Spark requires explicit crossJoin() for cross joins. Using join() without condition causes an AnalysisException.
❓ data_output
advanced1:00remaining
Number of rows after cross join
Given two DataFrames df1 with 3 rows and df2 with 4 rows, how many rows will the DataFrame have after a cross join?
Attempts:
2 left
💡 Hint
Multiply the number of rows in each DataFrame.
✗ Incorrect
Cross join produces the Cartesian product, so total rows = rows in df1 * rows in df2 = 3 * 4 = 12.
🚀 Application
expert2:30remaining
Avoiding cross join explosion in Spark
You have two large DataFrames and need to join them without a common key. Which approach best avoids the performance problems of a cross join?
Attempts:
2 left
💡 Hint
Think about how to reduce data shuffle and avoid Cartesian explosion.
✗ Incorrect
Broadcast join sends the smaller DataFrame to all nodes, allowing efficient join without cross join explosion.