User-defined functions with Snowpark in Snowflake - Time & Space Complexity
When using user-defined functions (UDFs) with Snowpark, it is important to understand how the execution time changes as the amount of data grows.
We want to know how the number of times the UDF runs affects the total work done.
Analyze the time complexity of applying a UDF to a Snowflake table column.
from snowflake.snowpark import Session
from snowflake.snowpark.functions import udf
session = Session.builder.configs({...}).create()
@udf
def add_one(x: int) -> int:
return x + 1
df = session.table("numbers")
df = df.select(add_one(df["value"]).alias("value_plus_one"))
df.collect()
This code defines a simple UDF that adds one to a number, applies it to each row in the "numbers" table, and collects the results.
Look at what happens repeatedly when this code runs.
- Primary operation: The UDF is called once for each row in the table.
- How many times: Equal to the number of rows in the "numbers" table.
As the number of rows increases, the UDF runs more times, directly matching the row count.
| Input Size (n) | Approx. Api Calls/Operations |
|---|---|
| 10 | 10 UDF calls |
| 100 | 100 UDF calls |
| 1000 | 1000 UDF calls |
Pattern observation: The number of UDF calls grows directly with the number of rows.
Time Complexity: O(n)
This means the total work grows in a straight line with the number of rows processed.
[X] Wrong: "The UDF runs only once regardless of data size."
[OK] Correct: The UDF is applied to each row separately, so it runs as many times as there are rows.
Understanding how UDFs scale with data size shows you can predict performance and design efficient data processing tasks.
"What if the UDF was applied only to a filtered subset of rows? How would the time complexity change?"