Challenge - 5 Problems

🎖️

Cross Join Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of a simple cross join in Spark

What is the output of the following Spark code snippet?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, "A"), (2, "B")]
data2 = [("X", 10), ("Y", 20)]
df1 = spark.createDataFrame(data1, ["id", "val1"])
df2 = spark.createDataFrame(data2, ["val2", "num"])
cross_joined = df1.crossJoin(df2)
cross_joined.show()

A[Row(id=1, val1='A', val2='X', num=10), Row(id=1, val1='A', val2='Y', num=20), Row(id=2, val1='B', val2='X', num=10), Row(id=2, val1='B', val2='Y', num=20)]

B[Row(id=1, val1='A', val2='X', num=10), Row(id=2, val1='B', val2='Y', num=20)]

C[Row(id=1, val1='A', val2='X', num=10), Row(id=1, val1='A', val2='Y', num=20)]

DSyntaxError

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

When to avoid cross joins in Spark

Which of the following is the best reason to avoid cross joins in Spark?

ACross joins are deprecated and not supported in Spark.

BCross joins always produce empty results, so they are useless.

CCross joins automatically filter rows, which can cause data loss.

DCross joins can produce very large datasets that consume excessive memory and slow down processing.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in cross join usage

What error will this Spark code raise?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, "A")]
data2 = [("X", 10)]
df1 = spark.createDataFrame(data1, ["id", "val1"])
df2 = spark.createDataFrame(data2, ["val2", "num"])
result = df1.join(df2)
result.show()

ANo error, outputs the cross join result

BAnalysisException: 'Detected cartesian product for INNER join, use CROSS JOIN if intended.'

CTypeError: join() missing required argument

DSyntaxError

Attempts:

2 left

❓ data_output

advanced

1:00remaining

Number of rows after cross join

Given two DataFrames df1 with 3 rows and df2 with 4 rows, how many rows will the DataFrame have after a cross join?

A12

Attempts:

2 left

🚀 Application

expert

2:30remaining

Avoiding cross join explosion in Spark

You have two large DataFrames and need to join them without a common key. Which approach best avoids the performance problems of a cross join?

AUse join() without condition to let Spark optimize automatically.

BUse crossJoin() directly to get all combinations.

CUse broadcast join to send the smaller DataFrame to all nodes and join on a condition.

DConvert both DataFrames to Pandas and join locally.

Attempts:

2 left