0
0
Apache Sparkdata~20 mins

Self joins in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Self Join Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of a self join with aliasing
What is the output of the following Apache Spark code snippet that performs a self join on a DataFrame of employees to find pairs of employees working in the same department?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice", "HR"), (2, "Bob", "HR"), (3, "Charlie", "IT"), (4, "David", "IT"), (5, "Eve", "Finance")]
columns = ["id", "name", "dept"]
df = spark.createDataFrame(data, columns)
df1 = df.alias("df1")
df2 = df.alias("df2")
joined = df1.join(df2, (col("df1.dept") == col("df2.dept")) & (col("df1.id") < col("df2.id")), "inner")
result = joined.select(col("df1.name").alias("emp1"), col("df2.name").alias("emp2"))
result.show()
A[Row(emp1='Alice', emp2='Bob'), Row(emp1='Charlie', emp2='David')]
B[Row(emp1='Alice', emp2='Bob'), Row(emp1='Bob', emp2='Alice'), Row(emp1='Charlie', emp2='David'), Row(emp1='David', emp2='Charlie')]
C[Row(emp1='Bob', emp2='Alice'), Row(emp1='David', emp2='Charlie')]
D[Row(emp1='Alice', emp2='Eve'), Row(emp1='Charlie', emp2='David')]
Attempts:
2 left
💡 Hint
Look at the join condition carefully, especially the part with id comparison.
data_output
intermediate
1:30remaining
Number of rows after self join with filter
Given a DataFrame of products with columns (product_id, category), what is the number of rows in the result after performing a self join on category with the condition product_id from left < product_id from right?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(101, "A"), (102, "A"), (103, "B"), (104, "B"), (105, "B"), (106, "C")]
columns = ["product_id", "category"]
df = spark.createDataFrame(data, columns)
df1 = df.alias("df1")
df2 = df.alias("df2")
joined = df1.join(df2, (col("df1.category") == col("df2.category")) & (col("df1.product_id") < col("df2.product_id")), "inner")
count = joined.count()
print(count)
A6
B4
C5
D3
Attempts:
2 left
💡 Hint
Count pairs per category where left product_id is less than right product_id.
🔧 Debug
advanced
1:30remaining
Identify the error in self join code
What error does the following Apache Spark code raise when trying to perform a self join on a DataFrame of users to find pairs with the same city?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(1, "John", "NY"), (2, "Jane", "LA"), (3, "Doe", "NY")]
columns = ["user_id", "name", "city"]
df = spark.createDataFrame(data, columns)
joined = df.join(df, df.city == df.city, "inner")
joined.show()
AAnalysisException: Reference 'city' is ambiguous, could be: city, city.
BValueError: Join condition must be a boolean expression.
CTypeError: Column is not iterable.
DNo error, outputs all pairs of users.
Attempts:
2 left
💡 Hint
Think about what happens when you join a DataFrame with itself without aliasing.
🚀 Application
advanced
2:00remaining
Use self join to find manager-subordinate pairs
You have an employee DataFrame with columns (emp_id, name, manager_id). How would you use a self join to find each employee's manager's name?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice", None), (2, "Bob", 1), (3, "Charlie", 1), (4, "David", 2)]
columns = ["emp_id", "name", "manager_id"]
df = spark.createDataFrame(data, columns)
df_emp = df.alias("emp")
df_mgr = df.alias("mgr")
joined = df_emp.join(df_mgr, col("emp.manager_id") == col("mgr.emp_id"), "left")
result = joined.select(col("emp.name").alias("employee"), col("mgr.name").alias("manager"))
result.show()
A[Row(employee='Alice', manager=None), Row(employee='Bob', manager=None), Row(employee='Charlie', manager=None), Row(employee='David', manager=None)]
B[Row(employee='Alice', manager='Bob'), Row(employee='Bob', manager='Charlie'), Row(employee='Charlie', manager='David'), Row(employee='David', manager=None)]
C[Row(employee='Alice', manager=None), Row(employee='Bob', manager='Alice'), Row(employee='Charlie', manager='Alice'), Row(employee='David', manager='Bob')]
D[Row(employee='Alice', manager='Alice'), Row(employee='Bob', manager='Bob'), Row(employee='Charlie', manager='Charlie'), Row(employee='David', manager='David')]
Attempts:
2 left
💡 Hint
Match employee's manager_id with manager's emp_id using aliases.
🧠 Conceptual
expert
1:30remaining
Why use self join with condition on IDs?
Why do we often add a condition like df1.id < df2.id when performing a self join to find pairs of related rows?
ATo join only rows with identical IDs.
BTo increase the number of pairs by including reversed duplicates.
CTo filter out all pairs where IDs are equal, but allow reversed pairs.
DTo avoid duplicate pairs and self-pairing by ensuring each pair is unique and ordered.
Attempts:
2 left
💡 Hint
Think about how pairs like (1,2) and (2,1) relate and how to avoid counting both.