How to Use Union in PySpark: Syntax and Examples
In PySpark, you can combine two DataFrames with the
union() method, which stacks rows from both DataFrames. Both DataFrames must have the same schema (same columns and types) for union() to work correctly.Syntax
The union() method combines two DataFrames by stacking their rows. Both DataFrames must have the same columns and data types.
df1.union(df2): Returns a new DataFrame with rows fromdf1followed by rows fromdf2.- The schemas of
df1anddf2must match exactly. - Duplicates are not removed; use
distinct()after union if needed.
python
unioned_df = df1.union(df2)
Example
This example shows how to create two simple DataFrames and combine them using union(). It demonstrates stacking rows from both DataFrames into one.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('UnionExample').getOrCreate() # Create first DataFrame data1 = [(1, 'Alice'), (2, 'Bob')] columns = ['id', 'name'] df1 = spark.createDataFrame(data1, columns) # Create second DataFrame data2 = [(3, 'Charlie'), (4, 'David')] df2 = spark.createDataFrame(data2, columns) # Use union to combine df_union = df1.union(df2) df_union.show()
Output
+---+-------+
| id| name|
+---+-------+
| 1| Alice|
| 2| Bob|
| 3|Charlie|
| 4| David|
+---+-------+
Common Pitfalls
Common mistakes when using union() include:
- Trying to union DataFrames with different columns or column types, which causes errors.
- Expecting
union()to remove duplicates; it does not. - Not using
distinct()after union if unique rows are needed.
Example of wrong and right usage:
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('UnionPitfalls').getOrCreate() # DataFrames with different schemas data1 = [(1, 'Alice')] columns1 = ['id', 'name'] df1 = spark.createDataFrame(data1, columns1) data2 = [(2, 30)] columns2 = ['id', 'age'] df2 = spark.createDataFrame(data2, columns2) # Wrong: union with different schemas (will error) # df1.union(df2) # This will raise an AnalysisException # Right: select matching columns before union df2_renamed = df2.withColumnRenamed('age', 'name').select('id', 'name') # Now union works but data meaning may be wrong union_df = df1.union(df2_renamed) union_df.show()
Output
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| 30|
+---+-----+
Quick Reference
Tips for using union() in PySpark:
- Ensure both DataFrames have the same schema before union.
- Use
distinct()after union to remove duplicate rows. - Use
unionByName()if columns are the same but in different order. - Union does not change the original DataFrames; it returns a new one.
Key Takeaways
Use
union() to stack rows from two DataFrames with the same schema.Both DataFrames must have identical columns and data types for
union() to work.Duplicates are not removed by
union(); use distinct() if needed.Use
unionByName() to union DataFrames with columns in different orders.Always check schemas before union to avoid errors.