Multi-column joins help combine two tables using more than one matching column. This gives more accurate matches when one column is not enough.
Multi-column joins in Apache Spark
df1.join(df2, on=["col1", "col2"], how="inner")
The on parameter takes a list of column names to join on multiple columns.
The how parameter defines the join type: inner, left, right, full, etc.
id and region using an inner join.df1.join(df2, on=["id", "region"], how="inner")
product_id and store to keep all rows from df1.df1.join(df2, on=["product_id", "store"], how="left")
This code creates two small tables with columns id and region. It joins them on both columns to find matching rows. The result shows only rows where both id and region match.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MultiColumnJoinExample").getOrCreate() # Create first DataFrame data1 = [(1, "East", "Alice"), (2, "West", "Bob"), (3, "East", "Charlie")] columns1 = ["id", "region", "name"] df1 = spark.createDataFrame(data1, columns1) # Create second DataFrame data2 = [(1, "East", 100), (2, "East", 200), (2, "West", 300), (4, "West", 400)] columns2 = ["id", "region", "sales"] df2 = spark.createDataFrame(data2, columns2) # Perform multi-column inner join on 'id' and 'region' joined_df = df1.join(df2, on=["id", "region"], how="inner") # Show the result joined_df.show()
Make sure the columns you join on exist in both DataFrames.
Column names must match exactly in spelling and case when using a list in on.
You can use different join types like left, right, or full depending on your needs.
Multi-column joins combine tables using more than one column for better matching.
Use a list of column names in the on parameter to specify multiple columns.
Choose the join type with the how parameter to control which rows appear in the result.