A self join helps you compare rows within the same table or dataset. It lets you find relationships between data in one table by joining it to itself.
0
0
Self joins in Apache Spark
Introduction
Finding pairs of employees who work in the same department.
Comparing sales records from the same dataset to find duplicates or related entries.
Finding customers who referred other customers in the same customer list.
Matching products that have similar attributes within the same product list.
Syntax
Apache Spark
df1 = df.alias('a') df2 = df.alias('b') joined_df = df1.join(df2, condition, 'join_type')
You create two aliases of the same DataFrame to join it with itself.
The condition defines how rows from the two aliases match.
Examples
This example finds pairs of employees in the same department but with different IDs.
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, 'Alice', 10), (2, 'Bob', 10), (3, 'Charlie', 20)] df = spark.createDataFrame(data, ['id', 'name', 'dept']) # Create aliases df1 = df.alias('a') df2 = df.alias('b') # Join on same department but different employees joined = df1.join(df2, (df1.dept == df2.dept) & (df1.id != df2.id)) joined.show()
This example joins the table to itself to find all employees in the same department, including themselves.
Apache Spark
df1 = df.alias('left') df2 = df.alias('right') # Find employees who share the same department result = df1.join(df2, df1.dept == df2.dept, 'inner') result.select('left.name', 'right.name', 'left.dept').show()
Sample Program
This program finds pairs of employees who work in the same department but are different people. It shows their IDs, names, and department.
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SelfJoinExample').getOrCreate() # Sample data: employee id, name, department employees = [ (1, 'Alice', 10), (2, 'Bob', 10), (3, 'Charlie', 20), (4, 'David', 10), (5, 'Eve', 30) ] # Create DataFrame emp_df = spark.createDataFrame(employees, ['id', 'name', 'dept']) # Create two aliases of the same DataFrame emp1 = emp_df.alias('emp1') emp2 = emp_df.alias('emp2') # Join on same department but different employee ids joined_df = emp1.join(emp2, (emp1.dept == emp2.dept) & (emp1.id != emp2.id), 'inner') # Select columns to show pairs of employees in the same department result = joined_df.select(emp1.id.alias('emp1_id'), emp1.name.alias('emp1_name'), emp2.id.alias('emp2_id'), emp2.name.alias('emp2_name'), emp1.dept) result.show()
OutputSuccess
Important Notes
Always use aliases to avoid confusion when joining a DataFrame to itself.
Make sure to exclude matching the same row with itself if that is not desired, by adding a condition like id != id.
Self joins can create many rows if the matching condition is broad, so be careful with large datasets.
Summary
Self joins let you compare rows within the same dataset.
Use aliases to join a DataFrame to itself.
Filter carefully to avoid matching rows to themselves unless needed.