0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use Join in PySpark: Syntax and Examples

In PySpark, you use join() to combine two DataFrames based on a common column or condition. The method syntax is df1.join(df2, on, how), where on specifies the join keys and how defines the join type like inner, left, right, or outer.
๐Ÿ“

Syntax

The join() method in PySpark combines two DataFrames based on matching columns or expressions.

  • df1.join(df2, on, how): Joins df1 with df2.
  • on: Column name(s) or join expression to match rows.
  • how: Type of join - inner (default), left, right, outer, left_semi, or left_anti.
python
df1.join(df2, on='column_name', how='inner')
๐Ÿ’ป

Example

This example shows how to join two DataFrames on a common column id using an inner join.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('JoinExample').getOrCreate()

# Create first DataFrame
data1 = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')]
df1 = spark.createDataFrame(data1, ['id', 'name'])

# Create second DataFrame
data2 = [(1, 'Sales'), (2, 'Marketing'), (4, 'Finance')]
df2 = spark.createDataFrame(data2, ['id', 'department'])

# Perform inner join on 'id'
joined_df = df1.join(df2, on='id', how='inner')

joined_df.show()
Output
+---+-----+----------+ | id| name|department| +---+-----+----------+ | 1|Alice| Sales| | 2| Bob| Marketing| +---+-----+----------+
โš ๏ธ

Common Pitfalls

Common mistakes when using join() in PySpark include:

  • Not specifying the on parameter correctly, causing unexpected Cartesian products.
  • Using ambiguous column names without aliasing, leading to errors.
  • Forgetting to specify how and getting default inner join when another join type was intended.

Always check column names and join conditions carefully.

python
from pyspark.sql.functions import col

# Wrong: ambiguous column names without alias
# joined_df = df1.join(df2, 'id')  # works but if columns overlap, can cause confusion

# Right: use aliases to avoid ambiguity

df1_alias = df1.alias('df1')
df2_alias = df2.alias('df2')
joined_df = df1_alias.join(df2_alias, col('df1.id') == col('df2.id'), how='inner')

joined_df.select('df1.id', 'df1.name', 'df2.department').show()
Output
+---+-----+----------+ | id| name|department| +---+-----+----------+ | 1|Alice| Sales| | 2| Bob| Marketing| +---+-----+----------+
๐Ÿ“Š

Quick Reference

ParameterDescriptionExample Values
df2Second DataFrame to joindf2
onColumn(s) or expression to join on'id', ['id', 'date'], col('df1.id') == col('df2.id')
howType of join'inner', 'left', 'right', 'outer', 'left_semi', 'left_anti'
โœ…

Key Takeaways

Use df1.join(df2, on, how) to combine DataFrames on matching columns or conditions.
Specify the join type with 'how' to control which rows appear in the result.
Avoid ambiguous column names by using aliases when joining DataFrames with overlapping columns.
The default join type is 'inner' if 'how' is not specified.
Always verify join keys to prevent unintended large outputs or Cartesian products.