0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use Unpivot in PySpark: Syntax and Example

In PySpark, you can perform an unpivot operation by using the selectExpr method combined with the stack function. This converts columns into rows by specifying the number of columns to unpivot and their names and values inside stack.
๐Ÿ“

Syntax

The unpivot operation in PySpark is done using selectExpr with the stack function inside it.

  • stack(n, 'col1', col1, 'col2', col2, ...): n is the number of columns to unpivot.
  • Each pair inside stack is the column name as a string and the column itself.
  • The result is two columns: one for the column names and one for the values.
python
df.selectExpr("id", "stack(3, 'A', A, 'B', B, 'C', C) as (variable, value)")
๐Ÿ’ป

Example

This example shows how to unpivot three columns A, B, and C into two columns: variable and value.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("UnpivotExample").getOrCreate()

data = [
    (1, 10, 20, 30),
    (2, 40, 50, 60)
]

columns = ["id", "A", "B", "C"]

df = spark.createDataFrame(data, columns)

# Unpivot columns A, B, C
unpivoted_df = df.selectExpr(
    "id",
    "stack(3, 'A', A, 'B', B, 'C', C) as (variable, value)"
)

unpivoted_df.show()
Output
+---+--------+-----+ | id|variable|value| +---+--------+-----+ | 1| A| 10| | 1| B| 20| | 1| C| 30| | 2| A| 40| | 2| B| 50| | 2| C| 60| +---+--------+-----+
โš ๏ธ

Common Pitfalls

Common mistakes when unpivoting in PySpark include:

  • Not matching the number n in stack(n, ...) with the actual number of columns to unpivot.
  • Forgetting to alias the output columns after stack using as (col1, col2).
  • Passing column names as strings without their corresponding column values.

Example of a wrong and right way:

python
# Wrong way: missing alias and wrong number of columns
# df.selectExpr("id", "stack(2, 'A', A, 'B', B, 'C', C)")  # n=2 but 3 columns given

# Right way:
df.selectExpr("id", "stack(3, 'A', A, 'B', B, 'C', C) as (variable, value)")
๐Ÿ“Š

Quick Reference

ConceptDescriptionExample
stack(n, ...)Unpivot n columns into two columns (name and value)stack(3, 'A', A, 'B', B, 'C', C)
selectExprUse to run SQL expressions on DataFramedf.selectExpr("id", "stack(...) as (var, val)")
AliasName the output columns after stackas (variable, value)
Column pairsEach column name string followed by its column'A', A, 'B', B
โœ…

Key Takeaways

Use the stack function inside selectExpr to unpivot columns in PySpark.
Ensure the number n in stack matches the number of columns you want to unpivot.
Always alias the output columns after stack with as (name, value).
Pass column names as strings followed by their column references in pairs.
Common errors include mismatched n and missing aliasing after stack.