Challenge - 5 Problems
Spark Join Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of Inner Join in Spark
What is the output of the following Spark code performing an inner join?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data1 = [(1, "Alice"), (2, "Bob"), (3, "Charlie")] data2 = [(2, "Sales"), (3, "HR"), (4, "IT")] df1 = spark.createDataFrame(data1, ["id", "name"]) df2 = spark.createDataFrame(data2, ["id", "dept"]) joined_df = df1.join(df2, on="id", how="inner") result = joined_df.collect()
Attempts:
2 left
💡 Hint
Inner join returns only rows with matching keys in both tables.
✗ Incorrect
An inner join keeps only rows where the 'id' exists in both dataframes. Here, ids 2 and 3 are common.
❓ Predict Output
intermediate2:00remaining
Output of Left Outer Join in Spark
What is the output of this Spark code performing a left outer join?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data1 = [(1, "Alice"), (2, "Bob"), (3, "Charlie")] data2 = [(2, "Sales"), (3, "HR"), (4, "IT")] df1 = spark.createDataFrame(data1, ["id", "name"]) df2 = spark.createDataFrame(data2, ["id", "dept"]) joined_df = df1.join(df2, on="id", how="left") result = joined_df.collect()
Attempts:
2 left
💡 Hint
Left join keeps all rows from the left table, adding matching data from the right.
✗ Incorrect
Left outer join keeps all rows from df1. For id=1, no matching dept exists, so dept is None.
❓ Predict Output
advanced2:00remaining
Output of Right Outer Join in Spark
What is the output of this Spark code performing a right outer join?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data1 = [(1, "Alice"), (2, "Bob"), (3, "Charlie")] data2 = [(2, "Sales"), (3, "HR"), (4, "IT")] df1 = spark.createDataFrame(data1, ["id", "name"]) df2 = spark.createDataFrame(data2, ["id", "dept"]) joined_df = df1.join(df2, on="id", how="right") result = joined_df.collect()
Attempts:
2 left
💡 Hint
Right join keeps all rows from the right table, adding matching data from the left.
✗ Incorrect
Right outer join keeps all rows from df2. For id=4, no matching name exists, so name is None.
❓ Predict Output
advanced2:00remaining
Output of Full Outer Join in Spark
What is the output of this Spark code performing a full outer join?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data1 = [(1, "Alice"), (2, "Bob"), (3, "Charlie")] data2 = [(2, "Sales"), (3, "HR"), (4, "IT")] df1 = spark.createDataFrame(data1, ["id", "name"]) df2 = spark.createDataFrame(data2, ["id", "dept"]) joined_df = df1.join(df2, on="id", how="outer") result = joined_df.collect()
Attempts:
2 left
💡 Hint
Full outer join keeps all rows from both tables, filling missing values with None.
✗ Incorrect
Full outer join returns all ids from both dataframes. Missing matches have None values.
🧠 Conceptual
expert1:30remaining
Understanding Join Types in Spark
Which join type would you use in Spark to keep all rows from the left dataframe and only matching rows from the right dataframe, filling unmatched right columns with nulls?
Attempts:
2 left
💡 Hint
Think about which join keeps all left rows regardless of matches.
✗ Incorrect
Left outer join keeps all rows from the left dataframe and fills unmatched right columns with nulls.