0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use when and otherwise in PySpark for Conditional Logic

In PySpark, use when to specify a condition and its result, and otherwise to define the fallback value if the condition is false. These functions help create new columns with conditional values in a DataFrame.
๐Ÿ“

Syntax

The when function takes a condition and a value to return if the condition is true. The otherwise function defines the value to return if the condition is false. Together, they form a conditional expression for DataFrame columns.

  • when(condition, value): Returns value if condition is true.
  • otherwise(value): Returns value if the when condition is false.
python
from pyspark.sql.functions import when

# Basic syntax example
new_column = when(condition, value).otherwise(other_value)
๐Ÿ’ป

Example

This example shows how to create a new column status in a DataFrame based on the score column. If the score is 50 or more, the status is 'pass'; otherwise, it is 'fail'.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import when

spark = SparkSession.builder.appName('Example').getOrCreate()
data = [(1, 45), (2, 75), (3, 30), (4, 90)]
columns = ['id', 'score']
df = spark.createDataFrame(data, columns)

# Add 'status' column based on condition
result_df = df.withColumn('status', when(df.score >= 50, 'pass').otherwise('fail'))

result_df.show()
Output
+---+-----+------+ | id|score|status| +---+-----+------+ | 1| 45| fail| | 2| 75| pass| | 3| 30| fail| | 4| 90| pass| +---+-----+------+
โš ๏ธ

Common Pitfalls

One common mistake is forgetting to use otherwise, which results in null values when the when condition is false. Another is using multiple when statements without chaining them properly, which can cause unexpected results.

Always chain multiple conditions with when and end with otherwise for a default value.

python
from pyspark.sql.functions import when

# Wrong: Missing otherwise leads to nulls
wrong_df = df.withColumn('status', when(df.score >= 50, 'pass'))

# Right: Use otherwise to handle all cases
right_df = df.withColumn('status', when(df.score >= 80, 'excellent')
                                   .when(df.score >= 50, 'pass')
                                   .otherwise('fail'))

right_df.show()
Output
+---+-----+---------+ | id|score| status| +---+-----+---------+ | 1| 45| fail| | 2| 75| pass| | 3| 30| fail| | 4| 90|excellent| +---+-----+---------+
๐Ÿ“Š

Quick Reference

FunctionPurposeExample
when(condition, value)Returns value if condition is truewhen(df.age > 18, 'adult')
otherwise(value)Returns value if condition is falsewhen(df.age > 18, 'adult').otherwise('minor')
โœ…

Key Takeaways

Use when() to specify a condition and its result in a DataFrame column.
Always use otherwise() to define the fallback value for false conditions.
Chain multiple when() calls for multiple conditions before otherwise().
Missing otherwise() leads to null values for unmatched rows.
when and otherwise help create clear, readable conditional logic in PySpark.