0
0
Apache Sparkdata~20 mins

Select, filter, and where operations in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Spark Select and Filter Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of filter with multiple conditions
What is the output of this Apache Spark code snippet filtering a DataFrame?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, "apple", 10), (2, "banana", 5), (3, "carrot", 7), (4, "date", 10)]
df = spark.createDataFrame(data, ["id", "fruit", "quantity"])
filtered_df = df.filter((df.quantity == 10) & (df.fruit != "date"))
result = filtered_df.select("id", "fruit").collect()
print(result)
A[]
B[Row(id=1, fruit='apple')]
C[Row(id=1, fruit='apple'), Row(id=4, fruit='date')]
D[Row(id=4, fruit='date')]
Attempts:
2 left
💡 Hint
Remember that filter uses AND logic with & and excludes rows where fruit is 'date'.
data_output
intermediate
1:30remaining
Number of rows after where condition
How many rows remain after applying this where condition on the DataFrame?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(10, "red"), (15, "blue"), (20, "red"), (25, "green"), (30, "blue")]
df = spark.createDataFrame(data, ["value", "color"])
filtered_df = df.where("value > 15 AND color = 'blue'")
count = filtered_df.count()
print(count)
A3
B2
C1
D0
Attempts:
2 left
💡 Hint
Check which rows have value greater than 15 and color exactly 'blue'.
🔧 Debug
advanced
1:30remaining
Identify the error in this filter expression
What error does this Apache Spark code raise when run?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, 100), (2, 200)]
df = spark.createDataFrame(data, ["id", "score"])
filtered_df = df.filter(df.score >= 150)
filtered_df.show()
ASyntaxError: invalid syntax
BTypeError: '>' not supported between instances
CAnalysisException: cannot resolve 'score => 150' given input columns
DNo error, outputs filtered rows
Attempts:
2 left
💡 Hint
Check the operator used for comparison in the filter condition.
🚀 Application
advanced
2:30remaining
Select and filter to find fruits with quantity less than average
Given a DataFrame of fruits and quantities, which code snippet correctly selects fruit names with quantity less than the average quantity?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
spark = SparkSession.builder.getOrCreate()
data = [("apple", 10), ("banana", 5), ("carrot", 7), ("date", 10)]
df = spark.createDataFrame(data, ["fruit", "quantity"])
avg_qty = df.select(avg("quantity")).collect()[0][0]
Adf.filter(df.quantity < avg_qty).select("fruit").show()
Bdf.where("quantity < avg_qty").select("fruit").show()
Cdf.filter(df.quantity < df.avg_qty).select("fruit").show()
Ddf.filter(df["quantity"] < avg_qty).select("fruit").show()
Attempts:
2 left
💡 Hint
Remember avg_qty is a Python variable, not a DataFrame column.
🧠 Conceptual
expert
2:00remaining
Understanding lazy evaluation in filter and select
Which statement best describes how Apache Spark handles filter and select operations on a DataFrame?
ASpark executes select operations immediately but delays filter operations.
BSpark immediately executes filter and select operations and returns results.
CSpark executes filter operations immediately but delays select operations.
DSpark builds a query plan and delays execution until an action is called.
Attempts:
2 left
💡 Hint
Think about when Spark actually runs computations on data.