Challenge - 5 Problems
Multi-column Join Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of multi-column join with inner join type
Given two Spark DataFrames
df1 and df2 joined on columns id and category using inner join, what is the output of the following code?Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data1 = [(1, 'A', 100), (2, 'B', 200), (3, 'A', 300)] data2 = [(1, 'A', 'X'), (2, 'B', 'Y'), (2, 'C', 'Z')] df1 = spark.createDataFrame(data1, ['id', 'category', 'value1']) df2 = spark.createDataFrame(data2, ['id', 'category', 'value2']) joined_df = df1.join(df2, on=['id', 'category'], how='inner') result = joined_df.collect()
Attempts:
2 left
💡 Hint
Inner join returns only rows with matching keys in both DataFrames.
✗ Incorrect
An inner join on columns 'id' and 'category' returns rows where both DataFrames have the same values in these columns. Rows with id=2 and category='C' are only in df2, so they are excluded.
❓ data_output
intermediate1:30remaining
Number of rows after left join on multiple columns
If you perform a left join of
df1 with df2 on columns id and category, how many rows will the resulting DataFrame have?Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data1 = [(1, 'A', 100), (2, 'B', 200), (3, 'A', 300)] data2 = [(1, 'A', 'X'), (2, 'B', 'Y'), (2, 'C', 'Z')] df1 = spark.createDataFrame(data1, ['id', 'category', 'value1']) df2 = spark.createDataFrame(data2, ['id', 'category', 'value2']) joined_df = df1.join(df2, on=['id', 'category'], how='left') count = joined_df.count()
Attempts:
2 left
💡 Hint
Left join keeps all rows from the left DataFrame.
✗ Incorrect
Left join keeps all rows from df1 (3 rows). It adds matching rows from df2 or nulls if no match. So total rows remain 3.
🔧 Debug
advanced1:30remaining
Identify the error in multi-column join code
What error will this Spark code raise when trying to join two DataFrames on multiple columns?
Apache Spark
df1.join(df2, on='id,category', how='inner')
Attempts:
2 left
💡 Hint
The 'on' parameter expects a list of column names for multiple columns.
✗ Incorrect
Passing a single string with comma-separated column names is not valid. Spark treats it as one column name, which does not exist, causing AnalysisException.
🚀 Application
advanced1:30remaining
Choosing join type for multi-column join scenario
You have two DataFrames:
sales with columns store_id, date, sales_amount and targets with columns store_id, date, target_amount. You want to keep all sales records and add target amounts where available. Which join type should you use on columns store_id and date?Attempts:
2 left
💡 Hint
You want to keep all sales records even if no target exists.
✗ Incorrect
Left join keeps all rows from the left DataFrame (sales) and adds matching target data or null if no match.
🧠 Conceptual
expert1:30remaining
Effect of join keys order in multi-column join
In Spark DataFrame join, does the order of columns in the
on list (e.g., ['id', 'category'] vs ['category', 'id']) affect the join result?Attempts:
2 left
💡 Hint
Join keys are matched as a set, order does not matter for equality.
✗ Incorrect
Spark matches rows where all join keys are equal regardless of order in the list. The join result is the same.