0
0
Apache Sparkdata~10 mins

UDFs (User Defined Functions) in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - UDFs (User Defined Functions)
Define UDF function
Register UDF with Spark
Apply UDF to DataFrame column
Spark runs UDF on each row
New column with UDF results
Show or use transformed DataFrame
You first write a function, then tell Spark about it as a UDF. Spark applies it to each row, creating a new column with results.
Execution Sample
Apache Spark
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def greet(name):
    return f"Hello, {name}!"

greet_udf = udf(greet, StringType())
df = spark.createDataFrame([('Alice',), ('Bob',)], ['name'])
df2 = df.withColumn('greeting', greet_udf(df.name))
df2.show()
This code defines a UDF that adds a greeting to each name in the DataFrame.
Execution Table
StepActionInputUDF OutputDataFrame State
1Define function greetname stringN/ANo DataFrame change
2Register greet as UDFgreet functionUDF object createdNo DataFrame change
3Create DataFrame[('Alice',), ('Bob',)]N/ADataFrame with column 'name' and 2 rows
4Apply UDF to 'name' columnRow 1: 'Alice''Hello, Alice!'New column 'greeting' added to row 1
5Apply UDF to 'name' columnRow 2: 'Bob''Hello, Bob!'New column 'greeting' added to row 2
6Show DataFrameN/AN/ADataFrame shows columns 'name' and 'greeting' with values
💡 All rows processed, UDF applied to each, DataFrame transformed with new column
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 6
greetFunction not definedFunction definedFunction definedFunction defined
greet_udfNoneUDF object createdUDF object createdUDF object created
dfNoneNoneDataFrame with 2 rows and 'name' columnDataFrame unchanged
df2NoneNoneNoneDataFrame with 'name' and 'greeting' columns
Key Moments - 3 Insights
Why do we need to register the Python function as a UDF before using it in Spark?
Spark runs code distributed across machines and needs to know how to apply your Python function on its data. Registering as a UDF wraps your function so Spark can run it on each row, as shown in execution_table steps 2 and 4.
Does the UDF change the original DataFrame or create a new one?
The UDF does not change the original DataFrame. It creates a new DataFrame with the added column, as seen in execution_table step 6 where df2 has the new 'greeting' column.
What happens if the UDF is applied to a null or missing value?
If the input is null, the UDF receives None and should handle it. Otherwise, Spark will return null for that row's output. This is important to avoid errors during execution.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the UDF output when the input name is 'Bob'?
A'Hello, Bob!'
B'Hello, Alice!'
C'Bob'
DNone
💡 Hint
Check execution_table row 5 where input is 'Bob' and see the UDF output column.
At which step does the DataFrame get a new column added?
AStep 3
BStep 6
CStep 4
DStep 2
💡 Hint
Look at execution_table rows 4 and 5 where the UDF is applied and new column values appear.
If we skip registering the function as a UDF, what will happen when applying it to the DataFrame?
ASpark will apply the function automatically
BSpark will throw an error because it doesn't recognize the function
CThe DataFrame will remain unchanged
DThe function will run but output will be null
💡 Hint
Refer to key_moments question about why registering as UDF is necessary.
Concept Snapshot
UDFs let you run your own Python code on Spark DataFrames.
Define a Python function.
Register it as a UDF with a return type.
Apply it to DataFrame columns.
Spark runs it on each row, creating new columns.
Original DataFrame stays unchanged.
Full Transcript
User Defined Functions (UDFs) in Apache Spark allow you to apply your own Python functions to DataFrame columns. First, you define a Python function that does what you want, like adding a greeting to a name. Then, you register this function as a UDF with Spark, telling it the return type. After that, you apply the UDF to a DataFrame column. Spark runs your function on each row's value and creates a new column with the results. The original DataFrame does not change; instead, a new DataFrame with the added column is created. This process lets you customize data processing beyond built-in Spark functions.