How to Use when and otherwise in PySpark for Conditional Logic
In PySpark, use
when to specify a condition and its result, and otherwise to define the fallback value if the condition is false. These functions help create new columns with conditional values in a DataFrame.Syntax
The when function takes a condition and a value to return if the condition is true. The otherwise function defines the value to return if the condition is false. Together, they form a conditional expression for DataFrame columns.
when(condition, value): Returnsvalueifconditionis true.otherwise(value): Returnsvalueif thewhencondition is false.
python
from pyspark.sql.functions import when # Basic syntax example new_column = when(condition, value).otherwise(other_value)
Example
This example shows how to create a new column status in a DataFrame based on the score column. If the score is 50 or more, the status is 'pass'; otherwise, it is 'fail'.
python
from pyspark.sql import SparkSession from pyspark.sql.functions import when spark = SparkSession.builder.appName('Example').getOrCreate() data = [(1, 45), (2, 75), (3, 30), (4, 90)] columns = ['id', 'score'] df = spark.createDataFrame(data, columns) # Add 'status' column based on condition result_df = df.withColumn('status', when(df.score >= 50, 'pass').otherwise('fail')) result_df.show()
Output
+---+-----+------+
| id|score|status|
+---+-----+------+
| 1| 45| fail|
| 2| 75| pass|
| 3| 30| fail|
| 4| 90| pass|
+---+-----+------+
Common Pitfalls
One common mistake is forgetting to use otherwise, which results in null values when the when condition is false. Another is using multiple when statements without chaining them properly, which can cause unexpected results.
Always chain multiple conditions with when and end with otherwise for a default value.
python
from pyspark.sql.functions import when # Wrong: Missing otherwise leads to nulls wrong_df = df.withColumn('status', when(df.score >= 50, 'pass')) # Right: Use otherwise to handle all cases right_df = df.withColumn('status', when(df.score >= 80, 'excellent') .when(df.score >= 50, 'pass') .otherwise('fail')) right_df.show()
Output
+---+-----+---------+
| id|score| status|
+---+-----+---------+
| 1| 45| fail|
| 2| 75| pass|
| 3| 30| fail|
| 4| 90|excellent|
+---+-----+---------+
Quick Reference
| Function | Purpose | Example |
|---|---|---|
| when(condition, value) | Returns value if condition is true | when(df.age > 18, 'adult') |
| otherwise(value) | Returns value if condition is false | when(df.age > 18, 'adult').otherwise('minor') |
Key Takeaways
Use when() to specify a condition and its result in a DataFrame column.
Always use otherwise() to define the fallback value for false conditions.
Chain multiple when() calls for multiple conditions before otherwise().
Missing otherwise() leads to null values for unmatched rows.
when and otherwise help create clear, readable conditional logic in PySpark.