Challenge - 5 Problems
Spark Select and Filter Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of filter with multiple conditions
What is the output of this Apache Spark code snippet filtering a DataFrame?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, "apple", 10), (2, "banana", 5), (3, "carrot", 7), (4, "date", 10)] df = spark.createDataFrame(data, ["id", "fruit", "quantity"]) filtered_df = df.filter((df.quantity == 10) & (df.fruit != "date")) result = filtered_df.select("id", "fruit").collect() print(result)
Attempts:
2 left
💡 Hint
Remember that filter uses AND logic with & and excludes rows where fruit is 'date'.
✗ Incorrect
The filter keeps rows where quantity is 10 and fruit is not 'date'. Only the row with id=1 and fruit='apple' meets both conditions.
❓ data_output
intermediate1:30remaining
Number of rows after where condition
How many rows remain after applying this where condition on the DataFrame?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(10, "red"), (15, "blue"), (20, "red"), (25, "green"), (30, "blue")] df = spark.createDataFrame(data, ["value", "color"]) filtered_df = df.where("value > 15 AND color = 'blue'") count = filtered_df.count() print(count)
Attempts:
2 left
💡 Hint
Check which rows have value greater than 15 and color exactly 'blue'.
✗ Incorrect
Only the row with value 30 and color 'blue' meets both conditions, so count is 1.
🔧 Debug
advanced1:30remaining
Identify the error in this filter expression
What error does this Apache Spark code raise when run?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, 100), (2, 200)] df = spark.createDataFrame(data, ["id", "score"]) filtered_df = df.filter(df.score >= 150) filtered_df.show()
Attempts:
2 left
💡 Hint
Check the operator used for comparison in the filter condition.
✗ Incorrect
The operator '=>' is invalid syntax in Python; the correct operator is '>='.
🚀 Application
advanced2:30remaining
Select and filter to find fruits with quantity less than average
Given a DataFrame of fruits and quantities, which code snippet correctly selects fruit names with quantity less than the average quantity?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import avg spark = SparkSession.builder.getOrCreate() data = [("apple", 10), ("banana", 5), ("carrot", 7), ("date", 10)] df = spark.createDataFrame(data, ["fruit", "quantity"]) avg_qty = df.select(avg("quantity")).collect()[0][0]
Attempts:
2 left
💡 Hint
Remember avg_qty is a Python variable, not a DataFrame column.
✗ Incorrect
Option A correctly uses the Python variable avg_qty in the filter condition. Option A uses a string expression which does not recognize avg_qty. Option A is valid syntax but less common; however, it works the same as C. Option A incorrectly tries to access avg_qty as a DataFrame column.
🧠 Conceptual
expert2:00remaining
Understanding lazy evaluation in filter and select
Which statement best describes how Apache Spark handles filter and select operations on a DataFrame?
Attempts:
2 left
💡 Hint
Think about when Spark actually runs computations on data.
✗ Incorrect
Spark uses lazy evaluation, meaning it builds a plan for filter and select but only runs when an action like show() or collect() is called.