Challenge - 5 Problems
UDF Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of a simple UDF application
What is the output of this Spark code that uses a UDF to square numbers in a DataFrame?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType spark = SparkSession.builder.master('local').appName('UDF Test').getOrCreate() data = [(1,), (2,), (3,)] df = spark.createDataFrame(data, ['num']) square_udf = udf(lambda x: x * x, IntegerType()) df2 = df.withColumn('square', square_udf(df['num'])) df2.show()
Attempts:
2 left
💡 Hint
Think about what the lambda function does to each number.
✗ Incorrect
The UDF squares each number in the 'num' column, so the 'square' column contains the squares: 1, 4, and 9.
❓ data_output
intermediate2:00remaining
Result of UDF with string manipulation
Given this Spark code with a UDF that reverses strings, what is the resulting DataFrame output?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import StringType spark = SparkSession.builder.master('local').appName('UDF String').getOrCreate() data = [('apple',), ('banana',), ('cherry',)] df = spark.createDataFrame(data, ['fruit']) reverse_udf = udf(lambda s: s[::-1], StringType()) df2 = df.withColumn('reversed', reverse_udf(df['fruit'])) df2.show()
Attempts:
2 left
💡 Hint
The lambda reverses the string using slicing.
✗ Incorrect
The UDF reverses each string in the 'fruit' column, so the 'reversed' column shows the reversed fruit names.
🔧 Debug
advanced2:00remaining
Identify the error in this UDF code
What error will this Spark code raise when trying to apply the UDF?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType spark = SparkSession.builder.master('local').appName('UDF Error').getOrCreate() data = [(1,), (2,), (3,)] df = spark.createDataFrame(data, ['num']) def add_five(x): return x + 5 add_five_udf = udf(add_five) df2 = df.withColumn('plus_five', add_five_udf(df['num'])) df2.show()
Attempts:
2 left
💡 Hint
Check how the UDF is defined and registered.
✗ Incorrect
When creating a UDF in Spark, you must specify the return type. Omitting it causes an AnalysisException.
🚀 Application
advanced2:30remaining
Using UDF to categorize ages
You want to add a new column 'age_group' to a Spark DataFrame with ages, categorizing ages as 'child' (<18), 'adult' (18-64), or 'senior' (65+). Which UDF definition and usage will produce the correct output?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import StringType spark = SparkSession.builder.master('local').appName('Age Group').getOrCreate() data = [(5,), (30,), (70,)] df = spark.createDataFrame(data, ['age'])
Attempts:
2 left
💡 Hint
Check the age boundaries carefully for each group.
✗ Incorrect
The correct categorization is: child if age < 18, adult if 18 <= age <= 64, senior if age >= 65. Option B matches this logic.
🧠 Conceptual
expert1:30remaining
Why avoid UDFs when possible in Spark?
Which is the main reason to avoid using UDFs in Spark when native functions exist?
Attempts:
2 left
💡 Hint
Think about how Spark optimizes queries and executes code.
✗ Incorrect
UDFs run outside Spark's optimized engine, preventing query optimization and causing slower execution.