0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use explode in PySpark: Syntax and Examples

In PySpark, use the explode function to transform an array or map column into multiple rows, one for each element. It is commonly used with select or withColumn to flatten nested data structures.
๐Ÿ“

Syntax

The explode function takes a column containing arrays or maps and returns a new row for each element in the array or each key-value pair in the map.

Usage:

  • explode(column): Explodes an array or map column into multiple rows.
  • Used inside select or withColumn to flatten nested data.
python
from pyspark.sql.functions import explode

df.select(explode(df.array_column))
๐Ÿ’ป

Example

This example shows how to explode an array column to create one row per array element.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode

spark = SparkSession.builder.appName('explode-example').getOrCreate()

data = [
    (1, ['apple', 'banana', 'cherry']),
    (2, ['orange', 'grape'])
]

columns = ['id', 'fruits']
df = spark.createDataFrame(data, columns)

# Explode the 'fruits' array column
exploded_df = df.select('id', explode('fruits').alias('fruit'))

exploded_df.show()
Output
+---+------+ | id| fruit| +---+------+ | 1| apple| | 1|banana| | 1|cherry| | 2|orange| | 2| grape| +---+------+
โš ๏ธ

Common Pitfalls

Common mistakes when using explode include:

  • Trying to explode a column that is not an array or map, which causes errors.
  • Not aliasing the exploded column, leading to unclear column names.
  • Forgetting that explode increases the number of rows, which can affect joins or aggregations.
python
from pyspark.sql.functions import explode

# Wrong: exploding a string column (not array/map) causes error
# df.select(explode('id')).show()  # This will fail

# Right: explode only array/map columns
exploded_df = df.select('id', explode('fruits').alias('fruit'))
exploded_df.show()
Output
+---+------+ | id| fruit| +---+------+ | 1| apple| | 1|banana| | 1|cherry| | 2|orange| | 2| grape| +---+------+
๐Ÿ“Š

Quick Reference

FunctionDescriptionExample
explode(column)Converts array/map column into multiple rowsdf.select(explode('array_col'))
alias(name)Renames the exploded column for clarityexplode('array_col').alias('item')
withColumn(name, explode(col))Adds exploded data as new columndf.withColumn('item', explode('array_col'))
โœ…

Key Takeaways

Use explode to flatten array or map columns into multiple rows in PySpark.
Always alias the exploded column for clear output.
Exploding increases row count, so adjust downstream logic accordingly.
Only use explode on columns of type array or map to avoid errors.
Combine explode with select or withColumn to transform nested data.