How to Use Join in PySpark: Syntax and Examples
In PySpark, you use
join() to combine two DataFrames based on a common column or condition. The method syntax is df1.join(df2, on, how), where on specifies the join keys and how defines the join type like inner, left, right, or outer.Syntax
The join() method in PySpark combines two DataFrames based on matching columns or expressions.
- df1.join(df2, on, how): Joins
df1withdf2. on: Column name(s) or join expression to match rows.how: Type of join -inner(default),left,right,outer,left_semi, orleft_anti.
python
df1.join(df2, on='column_name', how='inner')
Example
This example shows how to join two DataFrames on a common column id using an inner join.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('JoinExample').getOrCreate() # Create first DataFrame data1 = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')] df1 = spark.createDataFrame(data1, ['id', 'name']) # Create second DataFrame data2 = [(1, 'Sales'), (2, 'Marketing'), (4, 'Finance')] df2 = spark.createDataFrame(data2, ['id', 'department']) # Perform inner join on 'id' joined_df = df1.join(df2, on='id', how='inner') joined_df.show()
Output
+---+-----+----------+
| id| name|department|
+---+-----+----------+
| 1|Alice| Sales|
| 2| Bob| Marketing|
+---+-----+----------+
Common Pitfalls
Common mistakes when using join() in PySpark include:
- Not specifying the
onparameter correctly, causing unexpected Cartesian products. - Using ambiguous column names without aliasing, leading to errors.
- Forgetting to specify
howand getting default inner join when another join type was intended.
Always check column names and join conditions carefully.
python
from pyspark.sql.functions import col # Wrong: ambiguous column names without alias # joined_df = df1.join(df2, 'id') # works but if columns overlap, can cause confusion # Right: use aliases to avoid ambiguity df1_alias = df1.alias('df1') df2_alias = df2.alias('df2') joined_df = df1_alias.join(df2_alias, col('df1.id') == col('df2.id'), how='inner') joined_df.select('df1.id', 'df1.name', 'df2.department').show()
Output
+---+-----+----------+
| id| name|department|
+---+-----+----------+
| 1|Alice| Sales|
| 2| Bob| Marketing|
+---+-----+----------+
Quick Reference
| Parameter | Description | Example Values |
|---|---|---|
| df2 | Second DataFrame to join | df2 |
| on | Column(s) or expression to join on | 'id', ['id', 'date'], col('df1.id') == col('df2.id') |
| how | Type of join | 'inner', 'left', 'right', 'outer', 'left_semi', 'left_anti' |
Key Takeaways
Use df1.join(df2, on, how) to combine DataFrames on matching columns or conditions.
Specify the join type with 'how' to control which rows appear in the result.
Avoid ambiguous column names by using aliases when joining DataFrames with overlapping columns.
The default join type is 'inner' if 'how' is not specified.
Always verify join keys to prevent unintended large outputs or Cartesian products.