0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use Union in PySpark: Syntax and Examples

In PySpark, you can combine two DataFrames with the union() method, which stacks rows from both DataFrames. Both DataFrames must have the same schema (same columns and types) for union() to work correctly.
๐Ÿ“

Syntax

The union() method combines two DataFrames by stacking their rows. Both DataFrames must have the same columns and data types.

  • df1.union(df2): Returns a new DataFrame with rows from df1 followed by rows from df2.
  • The schemas of df1 and df2 must match exactly.
  • Duplicates are not removed; use distinct() after union if needed.
python
unioned_df = df1.union(df2)
๐Ÿ’ป

Example

This example shows how to create two simple DataFrames and combine them using union(). It demonstrates stacking rows from both DataFrames into one.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('UnionExample').getOrCreate()

# Create first DataFrame
data1 = [(1, 'Alice'), (2, 'Bob')]
columns = ['id', 'name']
df1 = spark.createDataFrame(data1, columns)

# Create second DataFrame
data2 = [(3, 'Charlie'), (4, 'David')]
df2 = spark.createDataFrame(data2, columns)

# Use union to combine
df_union = df1.union(df2)

df_union.show()
Output
+---+-------+ | id| name| +---+-------+ | 1| Alice| | 2| Bob| | 3|Charlie| | 4| David| +---+-------+
โš ๏ธ

Common Pitfalls

Common mistakes when using union() include:

  • Trying to union DataFrames with different columns or column types, which causes errors.
  • Expecting union() to remove duplicates; it does not.
  • Not using distinct() after union if unique rows are needed.

Example of wrong and right usage:

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('UnionPitfalls').getOrCreate()

# DataFrames with different schemas

data1 = [(1, 'Alice')]
columns1 = ['id', 'name']
df1 = spark.createDataFrame(data1, columns1)

data2 = [(2, 30)]
columns2 = ['id', 'age']
df2 = spark.createDataFrame(data2, columns2)

# Wrong: union with different schemas (will error)
# df1.union(df2)  # This will raise an AnalysisException

# Right: select matching columns before union

df2_renamed = df2.withColumnRenamed('age', 'name').select('id', 'name')

# Now union works but data meaning may be wrong
union_df = df1.union(df2_renamed)
union_df.show()
Output
+---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| 30| +---+-----+
๐Ÿ“Š

Quick Reference

Tips for using union() in PySpark:

  • Ensure both DataFrames have the same schema before union.
  • Use distinct() after union to remove duplicate rows.
  • Use unionByName() if columns are the same but in different order.
  • Union does not change the original DataFrames; it returns a new one.
โœ…

Key Takeaways

Use union() to stack rows from two DataFrames with the same schema.
Both DataFrames must have identical columns and data types for union() to work.
Duplicates are not removed by union(); use distinct() if needed.
Use unionByName() to union DataFrames with columns in different orders.
Always check schemas before union to avoid errors.