0
0
Apache Sparkdata~10 mins

Inner, left, right, and full outer joins in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Inner, left, right, and full outer joins
Start with two tables
Choose join type
Inner
Match rows based on key
Return joined table with different rules
End
Start with two tables, pick a join type, match rows by key, and return a combined table based on join rules.
Execution Sample
Apache Spark
df1 = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'val1'])
df2 = spark.createDataFrame([(2, 'X'), (3, 'Y')], ['id', 'val2'])
inner = df1.join(df2, 'id', 'inner')
left = df1.join(df2, 'id', 'left')
right = df1.join(df2, 'id', 'right')
full = df1.join(df2, 'id', 'outer')
Create two small tables and join them using inner, left, right, and full outer joins.
Execution Table
Join TypeRows MatchedRows ReturnedExplanation
Innerid=21 rowOnly rows with matching id in both tables are returned.
Leftid=1,22 rowsAll rows from left table; matched rows from right or null.
Rightid=2,32 rowsAll rows from right table; matched rows from left or null.
Full Outerid=1,2,33 rowsAll rows from both tables; unmatched sides get null.
💡 All join types complete after processing all keys from both tables.
Variable Tracker
VariableStartAfter Inner JoinAfter Left JoinAfter Right JoinAfter Full Outer Join
innerempty[{2, B, X}][{2, B, X}][{2, B, X}][{2, B, X}]
leftemptyempty[{1, A, null}, {2, B, X}]emptyempty
rightemptyemptyempty[{2, B, X}, {3, null, Y}]empty
fullemptyemptyemptyempty[{1, A, null}, {2, B, X}, {3, null, Y}]
Key Moments - 3 Insights
Why does the inner join return fewer rows than the full outer join?
Inner join returns only rows with matching keys in both tables (see execution_table row 1), while full outer join returns all rows from both tables, filling unmatched sides with null (row 4).
Why does the left join include a row with null values?
Left join keeps all rows from the left table even if no match in right table, so unmatched right columns are null (execution_table row 2).
What happens to rows in the right table that have no match in the left during a right join?
They appear in the result with nulls for left table columns (execution_table row 3).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, how many rows does the left join return?
A1 row
B2 rows
C3 rows
D0 rows
💡 Hint
Check the 'Rows Returned' column for the Left join in the execution_table.
At which join type do unmatched keys from both tables appear with nulls on the opposite side?
AFull outer join
BLeft join
CInner join
DRight join
💡 Hint
Look at the Explanation for Full Outer join in the execution_table.
If the key 'id=3' was removed from df2, which join would return fewer rows?
AInner join
BLeft join
CRight join
DFull outer join
💡 Hint
Check which join depends on keys only in the right table (see variable_tracker for right join).
Concept Snapshot
Joins combine two tables by matching keys.
Inner join: only matching rows.
Left join: all left rows + matched right.
Right join: all right rows + matched left.
Full outer join: all rows from both, unmatched get nulls.
Full Transcript
We start with two tables and want to combine them based on a shared key. We pick a join type: inner, left, right, or full outer. Inner join returns only rows where keys match in both tables. Left join returns all rows from the left table, adding matching rows from the right or null if no match. Right join returns all rows from the right table, adding matching rows from the left or null if no match. Full outer join returns all rows from both tables, filling unmatched sides with nulls. We saw example data and how each join returns different rows. This helps us combine data flexibly depending on what we want to keep.