Apache Sparkdata~5 mins

UDFs (User Defined Functions) in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Time Complexity: UDFs (User Defined Functions)

O(n)

Understanding Time Complexity

When using UDFs in Apache Spark, it's important to understand how their execution time grows as data size increases.

We want to know how the cost of applying a UDF changes when we have more rows to process.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def square(x):
    return x * x

square_udf = udf(square, IntegerType())
df = spark.range(0, 1000000)
df_with_square = df.withColumn('square', square_udf(df['id']))

This code defines a UDF to square numbers and applies it to each row in a Spark DataFrame.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

Primary operation: Applying the UDF function to each row in the DataFrame.
How many times: Once per row, so as many times as there are rows (n times).

How Execution Grows With Input

Each row requires one call to the UDF, so the total work grows directly with the number of rows.

Input Size (n)	Approx. Operations
10	10 UDF calls
100	100 UDF calls
1000	1000 UDF calls

Pattern observation: Doubling the number of rows doubles the number of UDF calls.

Final Time Complexity

Time Complexity: O(n)

This means the time to run the UDF grows linearly with the number of rows in the DataFrame.

Common Mistake

[X] Wrong: "UDFs run once and apply to all rows at the same time, so time does not grow with data size."

[OK] Correct: Each row must be processed individually by the UDF, so more rows mean more calls and more time.

Interview Connect

Understanding UDF time complexity helps you explain performance impacts in Spark jobs and shows you can reason about scaling data operations.

Self-Check

What if we replaced the UDF with a built-in Spark function? How would the time complexity change?