0
0
Apache Sparkdata~10 mins

Self joins in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Self joins
Start with one DataFrame
Create alias for DataFrame as A
Create alias for DataFrame as B
Join A and B on a condition
Result: Rows matched with themselves or related rows
Output DataFrame
Self join means joining a DataFrame with itself using aliases to compare rows within the same data.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)], ['id', 'name', 'group'])
df_alias1 = df.alias('a')
df_alias2 = df.alias('b')
joined = df_alias1.join(df_alias2, df_alias1.group == df_alias2.group).select('a.id', 'a.name', 'b.id', 'b.name')
joined.show()
This code creates a DataFrame and joins it with itself on the 'group' column to find rows sharing the same group.
Execution Table
StepActionDataFrame AliasConditionOutput Rows
1Create DataFramedfN/A[(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)]
2Create aliasaN/ASame as df
3Create aliasbN/ASame as df
4Join a and b on a.group == b.groupa, ba.group == b.group[(1, 'A', 1, 'A'), (1, 'A', 2, 'B'), (2, 'B', 1, 'A'), (2, 'B', 2, 'B'), (3, 'C', 3, 'C')]
5Select columns a.id, a.name, b.id, b.namea, bN/A[(1, 'A', 1, 'A'), (1, 'A', 2, 'B'), (2, 'B', 1, 'A'), (2, 'B', 2, 'B'), (3, 'C', 3, 'C')]
6Show outputN/AN/ADisplayed rows as above
7ExitN/ANo more stepsExecution complete
💡 All matching rows on group column are joined; no more steps.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4After Step 5Final
dfEmpty[(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)]SameSameSameSame
df_alias1 (a)N/A[(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)]SameSameSameSame
df_alias2 (b)N/AN/A[(1, 'A', 10), (2, 'B', 10), (3, 'C', 20)]SameSameSame
joinedN/AN/AN/A[(1, 'A', 1, 'A'), (1, 'A', 2, 'B'), (2, 'B', 1, 'A'), (2, 'B', 2, 'B'), (3, 'C', 3, 'C')]SameSame
Key Moments - 3 Insights
Why do we need to create aliases for the same DataFrame before joining?
Because joining a DataFrame with itself requires distinguishing the two instances; aliases 'a' and 'b' let Spark know which side of the join each column comes from, as shown in execution_table step 2 and 3.
Why do some rows join with themselves in the output?
Because the join condition matches rows with the same group, including the row itself. For example, row with id 1 joins with itself at step 4 in execution_table.
What happens if we join without a condition?
It would produce a Cartesian product (all combinations), which is usually very large and not useful. The condition 'a.group == b.group' limits the join to meaningful pairs, as in step 4.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 4, how many rows does the join produce?
A3
B5
C6
D4
💡 Hint
Check the 'Output Rows' column at step 4 in execution_table.
According to variable_tracker, what is the value of 'joined' after step 5?
AJoined rows with columns a.id, a.name, b.id, b.name
BEmpty DataFrame
CSame as df
DOnly rows where a.id equals b.id
💡 Hint
Look at the 'joined' row under 'After Step 5' in variable_tracker.
If we change the join condition to 'a.id == b.id', how would the output rows change?
ANo rows will join
BAll rows will join with all others, more rows
COnly rows where the same id matches, fewer rows
DOutput will be the same as original
💡 Hint
Think about how the join condition filters rows in execution_table step 4.
Concept Snapshot
Self joins let you join a DataFrame with itself using aliases.
Use .alias() to create distinct references.
Join on a condition comparing columns from each alias.
Useful to find related rows within the same data.
Remember to select columns carefully to avoid confusion.
Full Transcript
Self joins in Apache Spark mean joining a DataFrame with itself. We start by creating a DataFrame with data. Then we create two aliases for this DataFrame, named 'a' and 'b'. We join these aliases on a condition, for example, where the 'group' column matches. This produces rows where each row is paired with others in the same group, including itself. The output shows columns from both aliases. Aliases are necessary to distinguish the two sides of the join. Without a join condition, the result would be a large Cartesian product. This technique helps find relationships within the same dataset.