Challenge - 5 Problems
Spark Transformation Testing Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of a Spark DataFrame filter transformation
What is the output of the following Spark code snippet?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local').appName('test').getOrCreate() data = [(1, 'apple'), (2, 'banana'), (3, 'carrot')] df = spark.createDataFrame(data, ['id', 'fruit']) filtered_df = df.filter(df.id > 1) result = filtered_df.collect()
Attempts:
2 left
💡 Hint
Remember filter keeps rows where the condition is true.
✗ Incorrect
The filter condition df.id > 1 keeps rows with id 2 and 3 only.
❓ data_output
intermediate2:00remaining
Result of a Spark DataFrame groupBy and count
What is the output of this Spark code after grouping and counting?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local').appName('test').getOrCreate() data = [('red', 1), ('blue', 2), ('red', 3), ('blue', 4), ('green', 5)] df = spark.createDataFrame(data, ['color', 'value']) grouped_df = df.groupBy('color').count().orderBy('color') result = grouped_df.collect()
Attempts:
2 left
💡 Hint
Count how many times each color appears.
✗ Incorrect
Colors 'red' and 'blue' appear twice each, 'green' once. Ordering is alphabetical by color.
🔧 Debug
advanced2:00remaining
Identify the error in Spark DataFrame join code
What error will this Spark code raise when executed?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local').appName('test').getOrCreate() data1 = [(1, 'a'), (2, 'b')] data2 = [(1, 'x'), (3, 'y')] df1 = spark.createDataFrame(data1, ['id', 'val1']) df2 = spark.createDataFrame(data2, ['id', 'val2']) joined_df = df1.join(df2, on='ID') result = joined_df.collect()
Attempts:
2 left
💡 Hint
Check column name case sensitivity in join keys.
✗ Incorrect
The join key 'ID' does not match the actual column name 'id' (case sensitive), causing AnalysisException.
❓ visualization
advanced2:00remaining
Visualizing missing data counts in a Spark DataFrame
Which code snippet correctly computes the count of missing (null) values per column in a Spark DataFrame?
Attempts:
2 left
💡 Hint
Use when() to filter nulls and count() to count them.
✗ Incorrect
Option D uses when() to create a conditional column counting nulls, then counts them per column correctly.
🚀 Application
expert3:00remaining
Unit test for a Spark transformation function
Given a Spark transformation function that adds a new column 'double_value' doubling an existing 'value' column, which unit test code correctly verifies the transformation?
Apache Spark
def add_double_value_column(df): from pyspark.sql.functions import col return df.withColumn('double_value', col('value') * 2) # Unit test code options below
Attempts:
2 left
💡 Hint
Check that the new column values are double the original values.
✗ Incorrect
Option B collects the new column 'double_value' from the transformed DataFrame and compares it to the expected doubled values.