Apache-sparkHow-ToBeginner · 3 min read

How to Use UDF in PySpark: Simple Guide with Examples

In PySpark, you use udf to create a User Defined Function that applies custom logic to DataFrame columns. Define a Python function, register it with udf(), then use it in DataFrame operations like select or withColumn.

📐

Syntax

To use a UDF in PySpark, you first define a Python function with your custom logic. Then, you register this function as a UDF using pyspark.sql.functions.udf, optionally specifying the return type. Finally, you apply the UDF to DataFrame columns.

Define function: Your Python logic.
Register UDF: Use udf(your_function, returnType).
Apply UDF: Use in select or withColumn.

python

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def your_function(value):
    # your custom logic here
    return str(value).upper()

# Register UDF with return type
your_udf = udf(your_function, StringType())

# Use in DataFrame
# df.select(your_udf(df['column_name']))

💻

Example

This example shows how to create a UDF that converts strings to uppercase and apply it to a DataFrame column.

python

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.master('local[*]').appName('UDF Example').getOrCreate()

data = [('alice',), ('bob',), ('charlie',)]
df = spark.createDataFrame(data, ['name'])

def to_upper(name):
    return name.upper() if name else None

upper_udf = udf(to_upper, StringType())

# Apply UDF to create new column
result_df = df.withColumn('name_upper', upper_udf(df['name']))

result_df.show()

Output

+-------+----------+ | name|name_upper| +-------+----------+ | alice| ALICE| | bob| BOB| |charlie| CHARLIE| +-------+----------+

⚠️

Common Pitfalls

Common mistakes when using UDFs in PySpark include:

Not specifying the return type, which can cause errors or unexpected results.
Using Python functions that are not serializable or depend on external variables.
Applying UDFs on large datasets without considering performance impact, as UDFs run row-by-row and can be slower than built-in functions.

Always prefer built-in Spark SQL functions when possible for better performance.

python

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def bad_function(x):
    return x.lower()  # No return type specified in udf

# Wrong: missing return type
bad_udf = udf(bad_function)

# Right: specify return type
good_udf = udf(bad_function, StringType())

📊

Quick Reference

Step	Description	Example
Define Python function	Write your custom logic	def f(x): return x*2
Register UDF	Wrap function with udf() and specify return type	udf_f = udf(f, IntegerType())
Use UDF	Apply UDF in DataFrame operations	df.withColumn('new_col', udf_f(df['col']))

✅

Key Takeaways

Create a Python function with your custom logic before registering it as a UDF.

Always specify the return type when registering a UDF to avoid errors.

Use UDFs with DataFrame methods like withColumn or select to transform data.

Prefer built-in Spark functions over UDFs for better performance when possible.

UDFs run row-by-row and can slow down processing on large datasets.