0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use UDF in PySpark: Simple Guide with Examples

In PySpark, you use udf to create a User Defined Function that applies custom logic to DataFrame columns. Define a Python function, register it with udf(), then use it in DataFrame operations like select or withColumn.
๐Ÿ“

Syntax

To use a UDF in PySpark, you first define a Python function with your custom logic. Then, you register this function as a UDF using pyspark.sql.functions.udf, optionally specifying the return type. Finally, you apply the UDF to DataFrame columns.

  • Define function: Your Python logic.
  • Register UDF: Use udf(your_function, returnType).
  • Apply UDF: Use in select or withColumn.
python
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def your_function(value):
    # your custom logic here
    return str(value).upper()

# Register UDF with return type
your_udf = udf(your_function, StringType())

# Use in DataFrame
# df.select(your_udf(df['column_name']))
๐Ÿ’ป

Example

This example shows how to create a UDF that converts strings to uppercase and apply it to a DataFrame column.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.master('local[*]').appName('UDF Example').getOrCreate()

data = [('alice',), ('bob',), ('charlie',)]
df = spark.createDataFrame(data, ['name'])

def to_upper(name):
    return name.upper() if name else None

upper_udf = udf(to_upper, StringType())

# Apply UDF to create new column
result_df = df.withColumn('name_upper', upper_udf(df['name']))

result_df.show()
Output
+-------+----------+ | name|name_upper| +-------+----------+ | alice| ALICE| | bob| BOB| |charlie| CHARLIE| +-------+----------+
โš ๏ธ

Common Pitfalls

Common mistakes when using UDFs in PySpark include:

  • Not specifying the return type, which can cause errors or unexpected results.
  • Using Python functions that are not serializable or depend on external variables.
  • Applying UDFs on large datasets without considering performance impact, as UDFs run row-by-row and can be slower than built-in functions.

Always prefer built-in Spark SQL functions when possible for better performance.

python
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def bad_function(x):
    return x.lower()  # No return type specified in udf

# Wrong: missing return type
bad_udf = udf(bad_function)

# Right: specify return type
good_udf = udf(bad_function, StringType())
๐Ÿ“Š

Quick Reference

StepDescriptionExample
Define Python functionWrite your custom logicdef f(x): return x*2
Register UDFWrap function with udf() and specify return typeudf_f = udf(f, IntegerType())
Use UDFApply UDF in DataFrame operationsdf.withColumn('new_col', udf_f(df['col']))
โœ…

Key Takeaways

Create a Python function with your custom logic before registering it as a UDF.
Always specify the return type when registering a UDF to avoid errors.
Use UDFs with DataFrame methods like withColumn or select to transform data.
Prefer built-in Spark functions over UDFs for better performance when possible.
UDFs run row-by-row and can slow down processing on large datasets.