How to Use UDF in PySpark: Simple Guide with Examples
In PySpark, you use
udf to create a User Defined Function that applies custom logic to DataFrame columns. Define a Python function, register it with udf(), then use it in DataFrame operations like select or withColumn.Syntax
To use a UDF in PySpark, you first define a Python function with your custom logic. Then, you register this function as a UDF using pyspark.sql.functions.udf, optionally specifying the return type. Finally, you apply the UDF to DataFrame columns.
- Define function: Your Python logic.
- Register UDF: Use
udf(your_function, returnType). - Apply UDF: Use in
selectorwithColumn.
python
from pyspark.sql.functions import udf from pyspark.sql.types import StringType def your_function(value): # your custom logic here return str(value).upper() # Register UDF with return type your_udf = udf(your_function, StringType()) # Use in DataFrame # df.select(your_udf(df['column_name']))
Example
This example shows how to create a UDF that converts strings to uppercase and apply it to a DataFrame column.
python
from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import StringType spark = SparkSession.builder.master('local[*]').appName('UDF Example').getOrCreate() data = [('alice',), ('bob',), ('charlie',)] df = spark.createDataFrame(data, ['name']) def to_upper(name): return name.upper() if name else None upper_udf = udf(to_upper, StringType()) # Apply UDF to create new column result_df = df.withColumn('name_upper', upper_udf(df['name'])) result_df.show()
Output
+-------+----------+
| name|name_upper|
+-------+----------+
| alice| ALICE|
| bob| BOB|
|charlie| CHARLIE|
+-------+----------+
Common Pitfalls
Common mistakes when using UDFs in PySpark include:
- Not specifying the return type, which can cause errors or unexpected results.
- Using Python functions that are not serializable or depend on external variables.
- Applying UDFs on large datasets without considering performance impact, as UDFs run row-by-row and can be slower than built-in functions.
Always prefer built-in Spark SQL functions when possible for better performance.
python
from pyspark.sql.functions import udf from pyspark.sql.types import StringType def bad_function(x): return x.lower() # No return type specified in udf # Wrong: missing return type bad_udf = udf(bad_function) # Right: specify return type good_udf = udf(bad_function, StringType())
Quick Reference
| Step | Description | Example |
|---|---|---|
| Define Python function | Write your custom logic | def f(x): return x*2 |
| Register UDF | Wrap function with udf() and specify return type | udf_f = udf(f, IntegerType()) |
| Use UDF | Apply UDF in DataFrame operations | df.withColumn('new_col', udf_f(df['col'])) |
Key Takeaways
Create a Python function with your custom logic before registering it as a UDF.
Always specify the return type when registering a UDF to avoid errors.
Use UDFs with DataFrame methods like withColumn or select to transform data.
Prefer built-in Spark functions over UDFs for better performance when possible.
UDFs run row-by-row and can slow down processing on large datasets.