Challenge - 5 Problems
Self Join Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of a self join with aliasing
What is the output of the following Apache Spark code snippet that performs a self join on a DataFrame of employees to find pairs of employees working in the same department?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.getOrCreate() data = [(1, "Alice", "HR"), (2, "Bob", "HR"), (3, "Charlie", "IT"), (4, "David", "IT"), (5, "Eve", "Finance")] columns = ["id", "name", "dept"] df = spark.createDataFrame(data, columns) df1 = df.alias("df1") df2 = df.alias("df2") joined = df1.join(df2, (col("df1.dept") == col("df2.dept")) & (col("df1.id") < col("df2.id")), "inner") result = joined.select(col("df1.name").alias("emp1"), col("df2.name").alias("emp2")) result.show()
Attempts:
2 left
💡 Hint
Look at the join condition carefully, especially the part with id comparison.
✗ Incorrect
The join matches employees in the same department but only pairs where df1.id < df2.id to avoid duplicate pairs and self-pairing. So Alice-Bob and Charlie-David pairs appear once each.
❓ data_output
intermediate1:30remaining
Number of rows after self join with filter
Given a DataFrame of products with columns (product_id, category), what is the number of rows in the result after performing a self join on category with the condition product_id from left < product_id from right?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.getOrCreate() data = [(101, "A"), (102, "A"), (103, "B"), (104, "B"), (105, "B"), (106, "C")] columns = ["product_id", "category"] df = spark.createDataFrame(data, columns) df1 = df.alias("df1") df2 = df.alias("df2") joined = df1.join(df2, (col("df1.category") == col("df2.category")) & (col("df1.product_id") < col("df2.product_id")), "inner") count = joined.count() print(count)
Attempts:
2 left
💡 Hint
Count pairs per category where left product_id is less than right product_id.
✗ Incorrect
Category A has 2 products → 1 pair (101,102). Category B has 3 products → 3 pairs (103,104), (103,105), (104,105). Category C has 1 product → 0 pairs. Total pairs = 1 + 3 + 0 = 4.
🔧 Debug
advanced1:30remaining
Identify the error in self join code
What error does the following Apache Spark code raise when trying to perform a self join on a DataFrame of users to find pairs with the same city?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.getOrCreate() data = [(1, "John", "NY"), (2, "Jane", "LA"), (3, "Doe", "NY")] columns = ["user_id", "name", "city"] df = spark.createDataFrame(data, columns) joined = df.join(df, df.city == df.city, "inner") joined.show()
Attempts:
2 left
💡 Hint
Think about what happens when you join a DataFrame with itself without aliasing.
✗ Incorrect
Joining a DataFrame with itself without aliasing causes ambiguity because Spark cannot distinguish which 'city' column is from which side. This raises an AnalysisException about ambiguous references.
🚀 Application
advanced2:00remaining
Use self join to find manager-subordinate pairs
You have an employee DataFrame with columns (emp_id, name, manager_id). How would you use a self join to find each employee's manager's name?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.getOrCreate() data = [(1, "Alice", None), (2, "Bob", 1), (3, "Charlie", 1), (4, "David", 2)] columns = ["emp_id", "name", "manager_id"] df = spark.createDataFrame(data, columns) df_emp = df.alias("emp") df_mgr = df.alias("mgr") joined = df_emp.join(df_mgr, col("emp.manager_id") == col("mgr.emp_id"), "left") result = joined.select(col("emp.name").alias("employee"), col("mgr.name").alias("manager")) result.show()
Attempts:
2 left
💡 Hint
Match employee's manager_id with manager's emp_id using aliases.
✗ Incorrect
The self join matches each employee's manager_id with the emp_id of the manager. Alice has no manager, Bob and Charlie's manager is Alice, David's manager is Bob.
🧠 Conceptual
expert1:30remaining
Why use self join with condition on IDs?
Why do we often add a condition like df1.id < df2.id when performing a self join to find pairs of related rows?
Attempts:
2 left
💡 Hint
Think about how pairs like (1,2) and (2,1) relate and how to avoid counting both.
✗ Incorrect
The condition df1.id < df2.id ensures that for each pair, only one ordering is included, preventing duplicates like (1,2) and (2,1) and also excluding self-pairs where IDs are equal.