What is Multi-column joins in Apache Spark?

Apache Sparkdata~5 mins

Multi-column joins in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Multi-column joins help combine two tables using more than one matching column. This gives more accurate matches when one column is not enough.

You want to merge customer data from two sources using both customer ID and region.

You have sales data and product info, and want to join them by product ID and store location.

You need to combine employee records from two departments using employee ID and department code.

You want to join two datasets where matching requires date and user ID together.

Syntax

Apache Spark

df1.join(df2, on=["col1", "col2"], how="inner")

The on parameter takes a list of column names to join on multiple columns.

The how parameter defines the join type: inner, left, right, full, etc.

Examples

Join two DataFrames on columns id and region using an inner join.

Apache Spark

df1.join(df2, on=["id", "region"], how="inner")

Left join on product_id and store to keep all rows from df1.

Apache Spark

df1.join(df2, on=["product_id", "store"], how="left")

Sample Program

This code creates two small tables with columns id and region. It joins them on both columns to find matching rows. The result shows only rows where both id and region match.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MultiColumnJoinExample").getOrCreate()

# Create first DataFrame
data1 = [(1, "East", "Alice"), (2, "West", "Bob"), (3, "East", "Charlie")]
columns1 = ["id", "region", "name"]
df1 = spark.createDataFrame(data1, columns1)

# Create second DataFrame
data2 = [(1, "East", 100), (2, "East", 200), (2, "West", 300), (4, "West", 400)]
columns2 = ["id", "region", "sales"]
df2 = spark.createDataFrame(data2, columns2)

# Perform multi-column inner join on 'id' and 'region'
joined_df = df1.join(df2, on=["id", "region"], how="inner")

# Show the result
joined_df.show()

OutputSuccess

Important Notes

Make sure the columns you join on exist in both DataFrames.

Column names must match exactly in spelling and case when using a list in on.

You can use different join types like left, right, or full depending on your needs.

Summary

Multi-column joins combine tables using more than one column for better matching.

Use a list of column names in the on parameter to specify multiple columns.

Choose the join type with the how parameter to control which rows appear in the result.