Apache Sparkdata~10 mins

Self joins in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Self joins

Start with one DataFrame

↓

Create alias for DataFrame as A

↓

Create alias for DataFrame as B

↓

Join A and B on a condition

↓

Result: Rows matched with themselves or related rows

↓

Output DataFrame

Self join means joining a DataFrame with itself using aliases to compare rows within the same data.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)], ['id', 'name', 'group'])
df_alias1 = df.alias('a')
df_alias2 = df.alias('b')
joined = df_alias1.join(df_alias2, df_alias1.group == df_alias2.group).select('a.id', 'a.name', 'b.id', 'b.name')
joined.show()

This code creates a DataFrame and joins it with itself on the 'group' column to find rows sharing the same group.

Execution Table

Step	Action	DataFrame Alias	Condition	Output Rows
1	Create DataFrame	df	N/A	[(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)]
2	Create alias	a	N/A	Same as df
3	Create alias	b	N/A	Same as df
4	Join a and b on a.group == b.group	a, b	a.group == b.group	[(1, 'A', 1, 'A'), (1, 'A', 2, 'B'), (2, 'B', 1, 'A'), (2, 'B', 2, 'B'), (3, 'C', 3, 'C')]
5	Select columns a.id, a.name, b.id, b.name	a, b	N/A	[(1, 'A', 1, 'A'), (1, 'A', 2, 'B'), (2, 'B', 1, 'A'), (2, 'B', 2, 'B'), (3, 'C', 3, 'C')]
6	Show output	N/A	N/A	Displayed rows as above
7	Exit	N/A	No more steps	Execution complete

💡 All matching rows on group column are joined; no more steps.

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 4	After Step 5	Final
df	Empty	[(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)]	Same	Same	Same	Same
df_alias1 (a)	N/A	[(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)]	Same	Same	Same	Same
df_alias2 (b)	N/A	N/A	[(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)]	Same	Same	Same
joined	N/A	N/A	N/A	[(1, 'A', 1, 'A'), (1, 'A', 2, 'B'), (2, 'B', 1, 'A'), (2, 'B', 2, 'B'), (3, 'C', 3, 'C')]	Same	Same

Key Moments - 3 Insights

Why do we need to create aliases for the same DataFrame before joining?

Why do some rows join with themselves in the output?

What happens if we join without a condition?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table at step 4, how many rows does the join produce?

Concept Snapshot

Self joins let you join a DataFrame with itself using aliases.
Use .alias() to create distinct references.
Join on a condition comparing columns from each alias.
Useful to find related rows within the same data.
Remember to select columns carefully to avoid confusion.

Full Transcript

Self joins in Apache Spark mean joining a DataFrame with itself. We start by creating a DataFrame with data. Then we create two aliases for this DataFrame, named 'a' and 'b'. We join these aliases on a condition, for example, where the 'group' column matches. This produces rows where each row is paired with others in the same group, including itself. The output shows columns from both aliases. Aliases are necessary to distinguish the two sides of the join. Without a join condition, the result would be a large Cartesian product. This technique helps find relationships within the same dataset.