0
0
Apache Sparkdata~20 mins

Unit testing Spark transformations in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Spark Transformation Testing Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of a Spark DataFrame filter transformation
What is the output of the following Spark code snippet?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [(1, 'apple'), (2, 'banana'), (3, 'carrot')]
df = spark.createDataFrame(data, ['id', 'fruit'])
filtered_df = df.filter(df.id > 1)
result = filtered_df.collect()
A[(Row(id=1, fruit='apple'))]
B[(Row(id=1, fruit='apple'), Row(id=2, fruit='banana'))]
C[(Row(id=2, fruit='banana'), Row(id=3, fruit='carrot'))]
D[]
Attempts:
2 left
💡 Hint
Remember filter keeps rows where the condition is true.
data_output
intermediate
2:00remaining
Result of a Spark DataFrame groupBy and count
What is the output of this Spark code after grouping and counting?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [('red', 1), ('blue', 2), ('red', 3), ('blue', 4), ('green', 5)]
df = spark.createDataFrame(data, ['color', 'value'])
grouped_df = df.groupBy('color').count().orderBy('color')
result = grouped_df.collect()
A[(Row(color='blue', count=2), Row(color='green', count=1), Row(color='red', count=2))]
B[(Row(color='red', count=3), Row(color='blue', count=2), Row(color='green', count=1))]
C[(Row(color='blue', count=1), Row(color='green', count=1), Row(color='red', count=1))]
D[(Row(color='red', count=2), Row(color='blue', count=3), Row(color='green', count=1))]
Attempts:
2 left
💡 Hint
Count how many times each color appears.
🔧 Debug
advanced
2:00remaining
Identify the error in Spark DataFrame join code
What error will this Spark code raise when executed?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data1 = [(1, 'a'), (2, 'b')]
data2 = [(1, 'x'), (3, 'y')]
df1 = spark.createDataFrame(data1, ['id', 'val1'])
df2 = spark.createDataFrame(data2, ['id', 'val2'])
joined_df = df1.join(df2, on='ID')
result = joined_df.collect()
AAnalysisException: cannot resolve '`ID`' given input columns: [id, val1]
BNo error, returns joined rows on id
CTypeError: join() missing required positional argument 'on'
DValueError: join key must be a list or string
Attempts:
2 left
💡 Hint
Check column name case sensitivity in join keys.
visualization
advanced
2:00remaining
Visualizing missing data counts in a Spark DataFrame
Which code snippet correctly computes the count of missing (null) values per column in a Spark DataFrame?
Adf.select([F.count(F.when(F.col(c) == None, c)).alias(c) for c in df.columns])
Bdf.select([F.count(F.col(c).isNull()).alias(c) for c in df.columns])
Cdf.select([F.sum(F.col(c).isNull()).alias(c) for c in df.columns])
Ddf.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns])
Attempts:
2 left
💡 Hint
Use when() to filter nulls and count() to count them.
🚀 Application
expert
3:00remaining
Unit test for a Spark transformation function
Given a Spark transformation function that adds a new column 'double_value' doubling an existing 'value' column, which unit test code correctly verifies the transformation?
Apache Spark
def add_double_value_column(df):
    from pyspark.sql.functions import col
    return df.withColumn('double_value', col('value') * 2)

# Unit test code options below
A
def test_add_double_value_column(spark):
    data = [(1,), (2,), (3,)]
    df = spark.createDataFrame(data, ['value'])
    result_df = add_double_value_column(df)
    expected = [2, 4, 6]
    actual = [row.value for row in result_df.collect()]
    assert actual == expected
B
def test_add_double_value_column(spark):
    data = [(1,), (2,), (3,)]
    df = spark.createDataFrame(data, ['value'])
    result_df = add_double_value_column(df)
    expected = [2, 4, 6]
    actual = [row.double_value for row in result_df.collect()]
    assert actual == expected
C
def test_add_double_value_column(spark):
    data = [(1,), (2,), (3,)]
    df = spark.createDataFrame(data, ['value'])
    result_df = add_double_value_column(df)
    expected = [1, 2, 3]
    actual = [row.double_value for row in result_df.collect()]
    assert actual == expected
D
def test_add_double_value_column(spark):
    data = [(1,), (2,), (3,)]
    df = spark.createDataFrame(data, ['value'])
    result_df = add_double_value_column(df)
    expected = [2, 4, 6]
    actual = [row.double_value for row in df.collect()]
    assert actual == expected
Attempts:
2 left
💡 Hint
Check that the new column values are double the original values.