Joins help combine two tables based on matching data. This lets you see related information together.
0
0
Inner, left, right, and full outer joins in Apache Spark
Introduction
You have customer info in one table and orders in another, and want to see orders with customer details.
You want to find all employees and their departments, even if some employees have no department yet.
You want to find all products and any sales they have, including products with no sales.
You want to combine two lists of events, showing all events from both lists, matched where possible.
Syntax
Apache Spark
df1.join(df2, on='key_column', how='join_type')
df1 and df2 are the two tables (DataFrames) to join.
on is the column name(s) to match rows.
how is the join type: 'inner', 'left', 'right', or 'full'.
Examples
Returns rows where
id matches in both tables.Apache Spark
df1.join(df2, on='id', how='inner')
Returns all rows from
df1, with matching rows from df2 or null if no match.Apache Spark
df1.join(df2, on='id', how='left')
Returns all rows from
df2, with matching rows from df1 or null if no match.Apache Spark
df1.join(df2, on='id', how='right')
Returns all rows from both tables, matching where possible, null where no match.
Apache Spark
df1.join(df2, on='id', how='full')
Sample Program
This code creates two small tables: one with people and one with departments. It shows how to join them in four ways and prints the results.
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('JoinExample').getOrCreate() # Create first DataFrame data1 = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')] df1 = spark.createDataFrame(data1, ['id', 'name']) # Create second DataFrame data2 = [(2, 'Sales'), (3, 'HR'), (4, 'IT')] df2 = spark.createDataFrame(data2, ['id', 'department']) # Inner join inner_join = df1.join(df2, on='id', how='inner') print('Inner Join Result:') inner_join.show() # Left join left_join = df1.join(df2, on='id', how='left') print('Left Join Result:') left_join.show() # Right join right_join = df1.join(df2, on='id', how='right') print('Right Join Result:') right_join.show() # Full outer join full_join = df1.join(df2, on='id', how='fullouter') print('Full Outer Join Result:') full_join.show() spark.stop()
OutputSuccess
Important Notes
Inner join keeps only rows with matching keys in both tables.
Left join keeps all rows from the first table, adding nulls if no match in second.
Right join keeps all rows from the second table, adding nulls if no match in first.
Full outer join keeps all rows from both tables, filling nulls where no match.
Summary
Joins combine tables based on matching columns.
Inner join shows only matching rows.
Left, right, and full joins keep unmatched rows from one or both tables.