Challenge - 5 Problems

🎖️

UDF Mastery Badge

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of a simple UDF application

What is the output of this Spark code that uses a UDF to square numbers in a DataFrame?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.master('local').appName('UDF Test').getOrCreate()
data = [(1,), (2,), (3,)]
df = spark.createDataFrame(data, ['num'])

square_udf = udf(lambda x: x * x, IntegerType())
df2 = df.withColumn('square', square_udf(df['num']))
df2.show()

+---+------+
|num|square|
+---+------+
|  1|     1|
|  2|     4|
|  3|     9|
+---+------+

BTypeError: 'int' object is not callable

CSyntaxError: invalid syntax

+---+------+
|num|square|
+---+------+
|  1|     2|
|  2|     4|
|  3|     6|
+---+------+

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Result of UDF with string manipulation

Given this Spark code with a UDF that reverses strings, what is the resulting DataFrame output?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.master('local').appName('UDF String').getOrCreate()
data = [('apple',), ('banana',), ('cherry',)]
df = spark.createDataFrame(data, ['fruit'])

reverse_udf = udf(lambda s: s[::-1], StringType())
df2 = df.withColumn('reversed', reverse_udf(df['fruit']))
df2.show()

+------+--------+
| fruit|reversed|
+------+--------+
| apple|  elppa|
|banana| ananab|
|cherry| yrrehc|
+------+--------+

+------+--------+
| fruit|reversed|
+------+--------+
| apple|  apple|
|banana| banana|
|cherry| cherry|
+------+--------+

CTypeError: 'NoneType' object is not subscriptable

DSyntaxError: invalid syntax

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in this UDF code

What error will this Spark code raise when trying to apply the UDF?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.master('local').appName('UDF Error').getOrCreate()
data = [(1,), (2,), (3,)]
df = spark.createDataFrame(data, ['num'])

def add_five(x):
    return x + 5

add_five_udf = udf(add_five)
df2 = df.withColumn('plus_five', add_five_udf(df['num']))
df2.show()

ASyntaxError: invalid syntax

BTypeError: add_five() missing 1 required positional argument: 'x'

+---+---------+
|num|plus_five|
+---+---------+
|  1|        6|
|  2|        7|
|  3|        8|
+---+---------+

DAnalysisException: UDF return type must be specified

Attempts:

2 left

🚀 Application

advanced

2:30remaining

Using UDF to categorize ages

You want to add a new column 'age_group' to a Spark DataFrame with ages, categorizing ages as 'child' (<18), 'adult' (18-64), or 'senior' (65+). Which UDF definition and usage will produce the correct output?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.master('local').appName('Age Group').getOrCreate()
data = [(5,), (30,), (70,)]
df = spark.createDataFrame(data, ['age'])

age_group_udf = udf(lambda age: 'child' if age &lt; 18 else 'adult' if age &lt; 65 else 'senior', StringType())
df2 = df.withColumn('age_group', age_group_udf(df['age']))
df2.show()

age_group_udf = udf(lambda age: 'child' if age &lt; 18 else 'adult' if age &lt;= 64 else 'senior', StringType())
df2 = df.withColumn('age_group', age_group_udf(df['age']))
df2.show()

age_group_udf = udf(lambda age: 'child' if age &lt;= 18 else 'adult' if age &lt;= 65 else 'senior', StringType())
df2 = df.withColumn('age_group', age_group_udf(df['age']))
df2.show()

age_group_udf = udf(lambda age: 'child' if age &lt;= 18 else 'adult' if age &lt; 65 else 'senior', StringType())
df2 = df.withColumn('age_group', age_group_udf(df['age']))
df2.show()

Attempts:

2 left

🧠 Conceptual

expert

1:30remaining

Why avoid UDFs when possible in Spark?

Which is the main reason to avoid using UDFs in Spark when native functions exist?

AUDFs cannot handle null values, causing runtime exceptions.

BUDFs always cause syntax errors in Spark SQL queries.

CUDFs are slower because they break Spark's optimization and serialization, causing performance loss.

DUDFs automatically convert data to Pandas DataFrames, which is inefficient.

Attempts:

2 left