What is Self joins in Apache Spark?

Apache Sparkdata~5 mins

Self joins in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

A self join helps you compare rows within the same table or dataset. It lets you find relationships between data in one table by joining it to itself.

Finding pairs of employees who work in the same department.

Comparing sales records from the same dataset to find duplicates or related entries.

Finding customers who referred other customers in the same customer list.

Matching products that have similar attributes within the same product list.

Syntax

Apache Spark

df1 = df.alias('a')
df2 = df.alias('b')
joined_df = df1.join(df2, condition, 'join_type')

You create two aliases of the same DataFrame to join it with itself.

The condition defines how rows from the two aliases match.

Examples

This example finds pairs of employees in the same department but with different IDs.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, 'Alice', 10), (2, 'Bob', 10), (3, 'Charlie', 20)]
df = spark.createDataFrame(data, ['id', 'name', 'dept'])

# Create aliases
df1 = df.alias('a')
df2 = df.alias('b')

# Join on same department but different employees
joined = df1.join(df2, (df1.dept == df2.dept) & (df1.id != df2.id))
joined.show()

This example joins the table to itself to find all employees in the same department, including themselves.

Apache Spark

df1 = df.alias('left')
df2 = df.alias('right')

# Find employees who share the same department
result = df1.join(df2, df1.dept == df2.dept, 'inner')
result.select('left.name', 'right.name', 'left.dept').show()

Sample Program

This program finds pairs of employees who work in the same department but are different people. It shows their IDs, names, and department.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SelfJoinExample').getOrCreate()

# Sample data: employee id, name, department
employees = [
    (1, 'Alice', 10),
    (2, 'Bob', 10),
    (3, 'Charlie', 20),
    (4, 'David', 10),
    (5, 'Eve', 30)
]

# Create DataFrame
emp_df = spark.createDataFrame(employees, ['id', 'name', 'dept'])

# Create two aliases of the same DataFrame
emp1 = emp_df.alias('emp1')
emp2 = emp_df.alias('emp2')

# Join on same department but different employee ids
joined_df = emp1.join(emp2, (emp1.dept == emp2.dept) & (emp1.id != emp2.id), 'inner')

# Select columns to show pairs of employees in the same department
result = joined_df.select(emp1.id.alias('emp1_id'), emp1.name.alias('emp1_name'), emp2.id.alias('emp2_id'), emp2.name.alias('emp2_name'), emp1.dept)

result.show()

OutputSuccess

Important Notes

Always use aliases to avoid confusion when joining a DataFrame to itself.

Make sure to exclude matching the same row with itself if that is not desired, by adding a condition like id != id.

Self joins can create many rows if the matching condition is broad, so be careful with large datasets.

Summary

Self joins let you compare rows within the same dataset.

Use aliases to join a DataFrame to itself.

Filter carefully to avoid matching rows to themselves unless needed.