0
0
Apache Sparkdata~20 mins

Multi-column joins in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Multi-column Join Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of multi-column join with inner join type
Given two Spark DataFrames df1 and df2 joined on columns id and category using inner join, what is the output of the following code?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, 'A', 100), (2, 'B', 200), (3, 'A', 300)]
data2 = [(1, 'A', 'X'), (2, 'B', 'Y'), (2, 'C', 'Z')]
df1 = spark.createDataFrame(data1, ['id', 'category', 'value1'])
df2 = spark.createDataFrame(data2, ['id', 'category', 'value2'])
joined_df = df1.join(df2, on=['id', 'category'], how='inner')
result = joined_df.collect()
A[Row(id=1, category='A', value1=100, value2='X'), Row(id=2, category='B', value1=200, value2='Y'), Row(id=2, category='C', value1=None, value2='Z')]
B[Row(id=1, category='A', value1=100, value2='X'), Row(id=2, category='B', value1=200, value2='Y')]
C[Row(id=1, category='A', value1=100, value2='X'), Row(id=3, category='A', value1=300, value2=None)]
D[Row(id=2, category='C', value1=None, value2='Z')]
Attempts:
2 left
💡 Hint
Inner join returns only rows with matching keys in both DataFrames.
data_output
intermediate
1:30remaining
Number of rows after left join on multiple columns
If you perform a left join of df1 with df2 on columns id and category, how many rows will the resulting DataFrame have?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, 'A', 100), (2, 'B', 200), (3, 'A', 300)]
data2 = [(1, 'A', 'X'), (2, 'B', 'Y'), (2, 'C', 'Z')]
df1 = spark.createDataFrame(data1, ['id', 'category', 'value1'])
df2 = spark.createDataFrame(data2, ['id', 'category', 'value2'])
joined_df = df1.join(df2, on=['id', 'category'], how='left')
count = joined_df.count()
A2
B4
C3
D5
Attempts:
2 left
💡 Hint
Left join keeps all rows from the left DataFrame.
🔧 Debug
advanced
1:30remaining
Identify the error in multi-column join code
What error will this Spark code raise when trying to join two DataFrames on multiple columns?
Apache Spark
df1.join(df2, on='id,category', how='inner')
AAnalysisException: cannot resolve 'id,category' given input columns
BSyntaxError: invalid syntax
CNo error, join works correctly
DTypeError: join() argument 'on' must be a string or a list of strings
Attempts:
2 left
💡 Hint
The 'on' parameter expects a list of column names for multiple columns.
🚀 Application
advanced
1:30remaining
Choosing join type for multi-column join scenario
You have two DataFrames: sales with columns store_id, date, sales_amount and targets with columns store_id, date, target_amount. You want to keep all sales records and add target amounts where available. Which join type should you use on columns store_id and date?
AFull outer join
BInner join
CRight join
DLeft join
Attempts:
2 left
💡 Hint
You want to keep all sales records even if no target exists.
🧠 Conceptual
expert
1:30remaining
Effect of join keys order in multi-column join
In Spark DataFrame join, does the order of columns in the on list (e.g., ['id', 'category'] vs ['category', 'id']) affect the join result?
ANo, the order of join keys does not affect the join result
BNo, but it affects the performance of the join
CYes, the order changes the join keys and result rows
DYes, but only for outer joins, not inner joins
Attempts:
2 left
💡 Hint
Join keys are matched as a set, order does not matter for equality.