Challenge - 5 Problems

🎖️

Multi-column Join Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of multi-column join with inner join type

Given two Spark DataFrames df1 and df2 joined on columns id and category using inner join, what is the output of the following code?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, 'A', 100), (2, 'B', 200), (3, 'A', 300)]
data2 = [(1, 'A', 'X'), (2, 'B', 'Y'), (2, 'C', 'Z')]
df1 = spark.createDataFrame(data1, ['id', 'category', 'value1'])
df2 = spark.createDataFrame(data2, ['id', 'category', 'value2'])
joined_df = df1.join(df2, on=['id', 'category'], how='inner')
result = joined_df.collect()

A[Row(id=1, category='A', value1=100, value2='X'), Row(id=2, category='B', value1=200, value2='Y'), Row(id=2, category='C', value1=None, value2='Z')]

B[Row(id=1, category='A', value1=100, value2='X'), Row(id=2, category='B', value1=200, value2='Y')]

C[Row(id=1, category='A', value1=100, value2='X'), Row(id=3, category='A', value1=300, value2=None)]

D[Row(id=2, category='C', value1=None, value2='Z')]

Attempts:

2 left

❓ data_output

intermediate

1:30remaining

Number of rows after left join on multiple columns

If you perform a left join of df1 with df2 on columns id and category, how many rows will the resulting DataFrame have?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, 'A', 100), (2, 'B', 200), (3, 'A', 300)]
data2 = [(1, 'A', 'X'), (2, 'B', 'Y'), (2, 'C', 'Z')]
df1 = spark.createDataFrame(data1, ['id', 'category', 'value1'])
df2 = spark.createDataFrame(data2, ['id', 'category', 'value2'])
joined_df = df1.join(df2, on=['id', 'category'], how='left')
count = joined_df.count()

Attempts:

2 left

🔧 Debug

advanced

1:30remaining

Identify the error in multi-column join code

What error will this Spark code raise when trying to join two DataFrames on multiple columns?

Apache Spark

df1.join(df2, on='id,category', how='inner')

AAnalysisException: cannot resolve 'id,category' given input columns

BSyntaxError: invalid syntax

CNo error, join works correctly

DTypeError: join() argument 'on' must be a string or a list of strings

Attempts:

2 left

🚀 Application

advanced

1:30remaining

Choosing join type for multi-column join scenario

You have two DataFrames: sales with columns store_id, date, sales_amount and targets with columns store_id, date, target_amount. You want to keep all sales records and add target amounts where available. Which join type should you use on columns store_id and date?

AFull outer join

BInner join

CRight join

DLeft join

Attempts:

2 left

🧠 Conceptual

expert

1:30remaining

Effect of join keys order in multi-column join

In Spark DataFrame join, does the order of columns in the on list (e.g., ['id', 'category'] vs ['category', 'id']) affect the join result?

ANo, the order of join keys does not affect the join result

BNo, but it affects the performance of the join

CYes, the order changes the join keys and result rows

DYes, but only for outer joins, not inner joins

Attempts:

2 left