0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Create UDF in PySpark: Simple Guide with Examples

In PySpark, you create a User Defined Function (UDF) using pyspark.sql.functions.udf by defining a Python function and registering it with a return type. Then, you can apply this UDF to DataFrame columns to perform custom operations not built into Spark.
๐Ÿ“

Syntax

To create a UDF in PySpark, you use the udf function from pyspark.sql.functions. You define a Python function that performs your custom logic, then wrap it with udf specifying the return data type.

Example parts:

  • def my_func(x): โ€” your Python function
  • udf(my_func, returnType) โ€” wraps your function as a Spark UDF
  • returnType โ€” Spark SQL data type like StringType(), IntegerType(), etc.
python
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def my_func(x):
    return x.upper()

my_udf = udf(my_func, StringType())
๐Ÿ’ป

Example

This example shows how to create a UDF that converts strings to uppercase and apply it to a DataFrame column.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.appName('UDF Example').getOrCreate()

def to_upper(s):
    if s is not None:
        return s.upper()
    return None

upper_udf = udf(to_upper, StringType())

data = [('alice',), ('bob',), ('charlie',)]
columns = ['name']
df = spark.createDataFrame(data, columns)

# Apply UDF to 'name' column
result_df = df.withColumn('name_upper', upper_udf(df['name']))
result_df.show()
Output
+-------+----------+ | name|name_upper| +-------+----------+ | alice| ALICE| | bob| BOB| |charlie| CHARLIE| +-------+----------+
โš ๏ธ

Common Pitfalls

Common mistakes when creating UDFs in PySpark include:

  • Not specifying the return type, which can cause errors or unexpected results.
  • Using Python functions that are not serializable or rely on external state.
  • Forgetting to handle None (null) values inside the function, which can cause failures.
  • Using UDFs when built-in Spark SQL functions can do the job more efficiently.
python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

# Wrong: no return type specified
# bad_udf = udf(lambda x: x + 1)  # This can cause errors

# Right: specify return type
correct_udf = udf(lambda x: x + 1 if x is not None else None, IntegerType())
๐Ÿ“Š

Quick Reference

Summary tips for creating UDFs in PySpark:

  • Always import udf and the correct return type from pyspark.sql.types.
  • Define a simple Python function with your logic.
  • Wrap it with udf(your_function, returnType).
  • Apply the UDF to DataFrame columns using withColumn or select.
  • Handle None values inside your function to avoid errors.
  • Prefer built-in Spark functions when possible for better performance.
โœ…

Key Takeaways

Create a UDF by defining a Python function and wrapping it with pyspark.sql.functions.udf specifying the return type.
Always handle None values inside your UDF function to prevent runtime errors.
Use UDFs to apply custom logic on DataFrame columns not supported by built-in Spark functions.
Specify the correct Spark SQL return type when registering the UDF.
Prefer built-in Spark SQL functions over UDFs for better performance when possible.