How to Create UDF in PySpark: Simple Guide with Examples
In PySpark, you create a User Defined Function (UDF) using
pyspark.sql.functions.udf by defining a Python function and registering it with a return type. Then, you can apply this UDF to DataFrame columns to perform custom operations not built into Spark.Syntax
To create a UDF in PySpark, you use the udf function from pyspark.sql.functions. You define a Python function that performs your custom logic, then wrap it with udf specifying the return data type.
Example parts:
def my_func(x):โ your Python functionudf(my_func, returnType)โ wraps your function as a Spark UDFreturnTypeโ Spark SQL data type likeStringType(),IntegerType(), etc.
python
from pyspark.sql.functions import udf from pyspark.sql.types import StringType def my_func(x): return x.upper() my_udf = udf(my_func, StringType())
Example
This example shows how to create a UDF that converts strings to uppercase and apply it to a DataFrame column.
python
from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import StringType spark = SparkSession.builder.appName('UDF Example').getOrCreate() def to_upper(s): if s is not None: return s.upper() return None upper_udf = udf(to_upper, StringType()) data = [('alice',), ('bob',), ('charlie',)] columns = ['name'] df = spark.createDataFrame(data, columns) # Apply UDF to 'name' column result_df = df.withColumn('name_upper', upper_udf(df['name'])) result_df.show()
Output
+-------+----------+
| name|name_upper|
+-------+----------+
| alice| ALICE|
| bob| BOB|
|charlie| CHARLIE|
+-------+----------+
Common Pitfalls
Common mistakes when creating UDFs in PySpark include:
- Not specifying the return type, which can cause errors or unexpected results.
- Using Python functions that are not serializable or rely on external state.
- Forgetting to handle
None(null) values inside the function, which can cause failures. - Using UDFs when built-in Spark SQL functions can do the job more efficiently.
python
from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType # Wrong: no return type specified # bad_udf = udf(lambda x: x + 1) # This can cause errors # Right: specify return type correct_udf = udf(lambda x: x + 1 if x is not None else None, IntegerType())
Quick Reference
Summary tips for creating UDFs in PySpark:
- Always import
udfand the correct return type frompyspark.sql.types. - Define a simple Python function with your logic.
- Wrap it with
udf(your_function, returnType). - Apply the UDF to DataFrame columns using
withColumnorselect. - Handle
Nonevalues inside your function to avoid errors. - Prefer built-in Spark functions when possible for better performance.
Key Takeaways
Create a UDF by defining a Python function and wrapping it with pyspark.sql.functions.udf specifying the return type.
Always handle None values inside your UDF function to prevent runtime errors.
Use UDFs to apply custom logic on DataFrame columns not supported by built-in Spark functions.
Specify the correct Spark SQL return type when registering the UDF.
Prefer built-in Spark SQL functions over UDFs for better performance when possible.