Apache Sparkdata~5 mins

Inner, left, right, and full outer joins in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Joins help combine two tables based on matching data. This lets you see related information together.

You have customer info in one table and orders in another, and want to see orders with customer details.

You want to find all employees and their departments, even if some employees have no department yet.

You want to find all products and any sales they have, including products with no sales.

You want to combine two lists of events, showing all events from both lists, matched where possible.

Syntax

Apache Spark

df1.join(df2, on='key_column', how='join_type')

df1 and df2 are the two tables (DataFrames) to join.

on is the column name(s) to match rows.

how is the join type: 'inner', 'left', 'right', or 'full'.

Examples

Returns rows where id matches in both tables.

Apache Spark

df1.join(df2, on='id', how='inner')

Returns all rows from df1, with matching rows from df2 or null if no match.

Apache Spark

df1.join(df2, on='id', how='left')

Returns all rows from df2, with matching rows from df1 or null if no match.

Apache Spark

df1.join(df2, on='id', how='right')

Returns all rows from both tables, matching where possible, null where no match.

Apache Spark

df1.join(df2, on='id', how='full')

Sample Program

This code creates two small tables: one with people and one with departments. It shows how to join them in four ways and prints the results.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('JoinExample').getOrCreate()

# Create first DataFrame
data1 = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')]
df1 = spark.createDataFrame(data1, ['id', 'name'])

# Create second DataFrame
data2 = [(2, 'Sales'), (3, 'HR'), (4, 'IT')]
df2 = spark.createDataFrame(data2, ['id', 'department'])

# Inner join
inner_join = df1.join(df2, on='id', how='inner')
print('Inner Join Result:')
inner_join.show()

# Left join
left_join = df1.join(df2, on='id', how='left')
print('Left Join Result:')
left_join.show()

# Right join
right_join = df1.join(df2, on='id', how='right')
print('Right Join Result:')
right_join.show()

# Full outer join
full_join = df1.join(df2, on='id', how='fullouter')
print('Full Outer Join Result:')
full_join.show()

spark.stop()

OutputSuccess

Important Notes

Inner join keeps only rows with matching keys in both tables.

Left join keeps all rows from the first table, adding nulls if no match in second.

Right join keeps all rows from the second table, adding nulls if no match in first.

Full outer join keeps all rows from both tables, filling nulls where no match.

Summary

Joins combine tables based on matching columns.

Inner join shows only matching rows.

Left, right, and full joins keep unmatched rows from one or both tables.