How to Use explode in PySpark: Syntax and Examples
In PySpark, use the
explode function to transform an array or map column into multiple rows, one for each element. It is commonly used with select or withColumn to flatten nested data structures.Syntax
The explode function takes a column containing arrays or maps and returns a new row for each element in the array or each key-value pair in the map.
Usage:
explode(column): Explodes an array or map column into multiple rows.- Used inside
selectorwithColumnto flatten nested data.
python
from pyspark.sql.functions import explode df.select(explode(df.array_column))
Example
This example shows how to explode an array column to create one row per array element.
python
from pyspark.sql import SparkSession from pyspark.sql.functions import explode spark = SparkSession.builder.appName('explode-example').getOrCreate() data = [ (1, ['apple', 'banana', 'cherry']), (2, ['orange', 'grape']) ] columns = ['id', 'fruits'] df = spark.createDataFrame(data, columns) # Explode the 'fruits' array column exploded_df = df.select('id', explode('fruits').alias('fruit')) exploded_df.show()
Output
+---+------+
| id| fruit|
+---+------+
| 1| apple|
| 1|banana|
| 1|cherry|
| 2|orange|
| 2| grape|
+---+------+
Common Pitfalls
Common mistakes when using explode include:
- Trying to explode a column that is not an array or map, which causes errors.
- Not aliasing the exploded column, leading to unclear column names.
- Forgetting that
explodeincreases the number of rows, which can affect joins or aggregations.
python
from pyspark.sql.functions import explode # Wrong: exploding a string column (not array/map) causes error # df.select(explode('id')).show() # This will fail # Right: explode only array/map columns exploded_df = df.select('id', explode('fruits').alias('fruit')) exploded_df.show()
Output
+---+------+
| id| fruit|
+---+------+
| 1| apple|
| 1|banana|
| 1|cherry|
| 2|orange|
| 2| grape|
+---+------+
Quick Reference
| Function | Description | Example |
|---|---|---|
| explode(column) | Converts array/map column into multiple rows | df.select(explode('array_col')) |
| alias(name) | Renames the exploded column for clarity | explode('array_col').alias('item') |
| withColumn(name, explode(col)) | Adds exploded data as new column | df.withColumn('item', explode('array_col')) |
Key Takeaways
Use explode to flatten array or map columns into multiple rows in PySpark.
Always alias the exploded column for clear output.
Exploding increases row count, so adjust downstream logic accordingly.
Only use explode on columns of type array or map to avoid errors.
Combine explode with select or withColumn to transform nested data.