Challenge - 5 Problems

🎖️

Self Join Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of a self join with aliasing

What is the output of the following Apache Spark code snippet that performs a self join on a DataFrame of employees to find pairs of employees working in the same department?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice", "HR"), (2, "Bob", "HR"), (3, "Charlie", "IT"), (4, "David", "IT"), (5, "Eve", "Finance")]
columns = ["id", "name", "dept"]
df = spark.createDataFrame(data, columns)
df1 = df.alias("df1")
df2 = df.alias("df2")
joined = df1.join(df2, (col("df1.dept") == col("df2.dept")) & (col("df1.id") < col("df2.id")), "inner")
result = joined.select(col("df1.name").alias("emp1"), col("df2.name").alias("emp2"))
result.show()

A[Row(emp1='Alice', emp2='Bob'), Row(emp1='Charlie', emp2='David')]

B[Row(emp1='Alice', emp2='Bob'), Row(emp1='Bob', emp2='Alice'), Row(emp1='Charlie', emp2='David'), Row(emp1='David', emp2='Charlie')]

C[Row(emp1='Bob', emp2='Alice'), Row(emp1='David', emp2='Charlie')]

D[Row(emp1='Alice', emp2='Eve'), Row(emp1='Charlie', emp2='David')]

Attempts:

2 left

❓ data_output

intermediate

1:30remaining

Number of rows after self join with filter

Given a DataFrame of products with columns (product_id, category), what is the number of rows in the result after performing a self join on category with the condition product_id from left < product_id from right?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(101, "A"), (102, "A"), (103, "B"), (104, "B"), (105, "B"), (106, "C")]
columns = ["product_id", "category"]
df = spark.createDataFrame(data, columns)
df1 = df.alias("df1")
df2 = df.alias("df2")
joined = df1.join(df2, (col("df1.category") == col("df2.category")) & (col("df1.product_id") < col("df2.product_id")), "inner")
count = joined.count()
print(count)

Attempts:

2 left

🔧 Debug

advanced

1:30remaining

Identify the error in self join code

What error does the following Apache Spark code raise when trying to perform a self join on a DataFrame of users to find pairs with the same city?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(1, "John", "NY"), (2, "Jane", "LA"), (3, "Doe", "NY")]
columns = ["user_id", "name", "city"]
df = spark.createDataFrame(data, columns)
joined = df.join(df, df.city == df.city, "inner")
joined.show()

AAnalysisException: Reference 'city' is ambiguous, could be: city, city.

BValueError: Join condition must be a boolean expression.

CTypeError: Column is not iterable.

DNo error, outputs all pairs of users.

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Use self join to find manager-subordinate pairs

You have an employee DataFrame with columns (emp_id, name, manager_id). How would you use a self join to find each employee's manager's name?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice", None), (2, "Bob", 1), (3, "Charlie", 1), (4, "David", 2)]
columns = ["emp_id", "name", "manager_id"]
df = spark.createDataFrame(data, columns)
df_emp = df.alias("emp")
df_mgr = df.alias("mgr")
joined = df_emp.join(df_mgr, col("emp.manager_id") == col("mgr.emp_id"), "left")
result = joined.select(col("emp.name").alias("employee"), col("mgr.name").alias("manager"))
result.show()

A[Row(employee='Alice', manager=None), Row(employee='Bob', manager=None), Row(employee='Charlie', manager=None), Row(employee='David', manager=None)]

B[Row(employee='Alice', manager='Bob'), Row(employee='Bob', manager='Charlie'), Row(employee='Charlie', manager='David'), Row(employee='David', manager=None)]

C[Row(employee='Alice', manager=None), Row(employee='Bob', manager='Alice'), Row(employee='Charlie', manager='Alice'), Row(employee='David', manager='Bob')]

D[Row(employee='Alice', manager='Alice'), Row(employee='Bob', manager='Bob'), Row(employee='Charlie', manager='Charlie'), Row(employee='David', manager='David')]

Attempts:

2 left

🧠 Conceptual

expert

1:30remaining

Why use self join with condition on IDs?

Why do we often add a condition like df1.id < df2.id when performing a self join to find pairs of related rows?

ATo join only rows with identical IDs.

BTo increase the number of pairs by including reversed duplicates.

CTo filter out all pairs where IDs are equal, but allow reversed pairs.

DTo avoid duplicate pairs and self-pairing by ensuring each pair is unique and ordered.

Attempts:

2 left