How to Use Unpivot in PySpark: Syntax and Example
In PySpark, you can perform an
unpivot operation by using the selectExpr method combined with the stack function. This converts columns into rows by specifying the number of columns to unpivot and their names and values inside stack.Syntax
The unpivot operation in PySpark is done using selectExpr with the stack function inside it.
stack(n, 'col1', col1, 'col2', col2, ...): n is the number of columns to unpivot.- Each pair inside
stackis the column name as a string and the column itself. - The result is two columns: one for the column names and one for the values.
python
df.selectExpr("id", "stack(3, 'A', A, 'B', B, 'C', C) as (variable, value)")
Example
This example shows how to unpivot three columns A, B, and C into two columns: variable and value.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").appName("UnpivotExample").getOrCreate() data = [ (1, 10, 20, 30), (2, 40, 50, 60) ] columns = ["id", "A", "B", "C"] df = spark.createDataFrame(data, columns) # Unpivot columns A, B, C unpivoted_df = df.selectExpr( "id", "stack(3, 'A', A, 'B', B, 'C', C) as (variable, value)" ) unpivoted_df.show()
Output
+---+--------+-----+
| id|variable|value|
+---+--------+-----+
| 1| A| 10|
| 1| B| 20|
| 1| C| 30|
| 2| A| 40|
| 2| B| 50|
| 2| C| 60|
+---+--------+-----+
Common Pitfalls
Common mistakes when unpivoting in PySpark include:
- Not matching the number
ninstack(n, ...)with the actual number of columns to unpivot. - Forgetting to alias the output columns after
stackusingas (col1, col2). - Passing column names as strings without their corresponding column values.
Example of a wrong and right way:
python
# Wrong way: missing alias and wrong number of columns # df.selectExpr("id", "stack(2, 'A', A, 'B', B, 'C', C)") # n=2 but 3 columns given # Right way: df.selectExpr("id", "stack(3, 'A', A, 'B', B, 'C', C) as (variable, value)")
Quick Reference
| Concept | Description | Example |
|---|---|---|
| stack(n, ...) | Unpivot n columns into two columns (name and value) | stack(3, 'A', A, 'B', B, 'C', C) |
| selectExpr | Use to run SQL expressions on DataFrame | df.selectExpr("id", "stack(...) as (var, val)") |
| Alias | Name the output columns after stack | as (variable, value) |
| Column pairs | Each column name string followed by its column | 'A', A, 'B', B |
Key Takeaways
Use the stack function inside selectExpr to unpivot columns in PySpark.
Ensure the number n in stack matches the number of columns you want to unpivot.
Always alias the output columns after stack with as (name, value).
Pass column names as strings followed by their column references in pairs.
Common errors include mismatched n and missing aliasing after stack.