0
0
Apache Sparkdata~10 mins

Multi-column joins in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Multi-column joins
Start with two DataFrames
Specify join columns
Perform join on multiple columns
Result: DataFrame with matched rows
Handle unmatched rows if needed (outer, left, right)
End
We start with two tables, pick columns to match on, join them by all those columns, and get combined rows where all column values match.
Execution Sample
Apache Spark
df1.join(df2, on=["id", "date"], how="inner")
Join two DataFrames on columns 'id' and 'date' keeping only rows where both match.
Execution Table
StepActiondf1 Rowdf2 RowJoin Condition (id,date)Result Row Included?
1Check df1 row (id=1, date=2023-01-01) against df2 rows(1, 2023-01-01, A)(1, 2023-01-01, X)MatchYes
2Check df1 row (id=1, date=2023-01-01) against df2 rows(1, 2023-01-01, A)(2, 2023-01-02, Y)No MatchNo
3Check df1 row (id=2, date=2023-01-02) against df2 rows(2, 2023-01-02, B)(1, 2023-01-01, X)No MatchNo
4Check df1 row (id=2, date=2023-01-02) against df2 rows(2, 2023-01-02, B)(2, 2023-01-02, Y)MatchYes
5Check df1 row (id=3, date=2023-01-03) against df2 rows(3, 2023-01-03, C)(1, 2023-01-01, X)No MatchNo
6Check df1 row (id=3, date=2023-01-03) against df2 rows(3, 2023-01-03, C)(2, 2023-01-02, Y)No MatchNo
7End of rows
💡 All df1 rows checked against df2 rows; only rows with matching id and date included.
Variable Tracker
VariableStartAfter Step 1After Step 4Final
df1 current rowNone(1, 2023-01-01, A)(2, 2023-01-02, B)(3, 2023-01-03, C)
df2 current rowNone(1, 2023-01-01, X)(2, 2023-01-02, Y)(2, 2023-01-02, Y)
Result rows count0122
Key Moments - 2 Insights
Why do some rows from df1 not appear in the result?
Because the join is inner and requires matching values in all join columns (id and date). Rows without exact matches in df2 are excluded, as shown in execution_table rows 5 and 6.
What happens if only one join column matches but not the other?
The row is not included because multi-column join requires all specified columns to match simultaneously.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, how many rows are included in the join result?
A3
B2
C4
D1
💡 Hint
Check the 'Result Row Included?' column in the execution_table for rows marked 'Yes'.
At which step does the join condition fail because not all columns match?
AStep 2
BStep 4
CStep 1
DStep 5
💡 Hint
Look at execution_table row 2 where neither id nor date match.
If the join was changed to 'left' join, what would happen to df1 rows without matches?
AThey would be excluded
BThey would be duplicated
CThey would appear with nulls for df2 columns
DThey would cause an error
💡 Hint
In a left join, all df1 rows appear; unmatched df2 columns show nulls.
Concept Snapshot
Multi-column joins combine two tables by matching multiple columns at once.
Syntax: df1.join(df2, on=[col1, col2], how='inner')
Only rows where all join columns match are included in inner join.
Other join types (left, right, outer) handle unmatched rows differently.
Useful to match complex keys like (id, date) pairs.
Full Transcript
Multi-column joins in Apache Spark combine two DataFrames by matching rows where multiple columns have the same values. We specify the columns to join on as a list. The join keeps rows where all these columns match. For example, joining on 'id' and 'date' means only rows with the same id and date in both DataFrames appear in the result. If a row in one DataFrame has no match in the other, it is excluded in an inner join. This process is shown step-by-step in the execution table, where each row from the first DataFrame is checked against rows in the second DataFrame. The variable tracker shows how current rows and result count change during execution. Key moments clarify why some rows are excluded and how all join columns must match. The visual quiz tests understanding of these steps and join behavior. Multi-column joins are useful when a single column is not enough to uniquely match rows.