Which statement best describes lazy evaluation in Apache Spark?
Think about when Spark actually runs the computations.
Lazy evaluation means Spark waits to run transformations until an action triggers execution.
What will be the output of the following Spark code snippet?
rdd = sc.parallelize([1, 2, 3, 4]) mapped_rdd = rdd.map(lambda x: x * 2) print(mapped_rdd.collect())
rdd = sc.parallelize([1, 2, 3, 4]) mapped_rdd = rdd.map(lambda x: x * 2) print(mapped_rdd.collect())
Consider what triggers execution in Spark.
The collect() action triggers execution, so the doubled values are returned.
Given the following Spark code, how many Spark jobs will be triggered?
rdd = sc.parallelize([1, 2, 3, 4]) mapped = rdd.map(lambda x: x + 1) filtered = mapped.filter(lambda x: x % 2 == 0) count = filtered.count() collected = filtered.collect()
rdd = sc.parallelize([1, 2, 3, 4]) mapped = rdd.map(lambda x: x + 1) filtered = mapped.filter(lambda x: x % 2 == 0) count = filtered.count() collected = filtered.collect()
Each action triggers a job. How many actions are there?
There are two actions: count() and collect(), so two jobs run.
Why does the following Spark code produce no output?
rdd = sc.parallelize([10, 20, 30]) rdd.map(lambda x: x * 3)
rdd = sc.parallelize([10, 20, 30]) rdd.map(lambda x: x * 3)
Think about what triggers Spark to run transformations.
Transformations like map are lazy and need an action like collect() to run.
You have a Spark job with multiple transformations and two actions on the same RDD. How can you optimize to avoid running the same transformations twice?
Think about how Spark can reuse data between actions.
Caching or persisting stores the RDD in memory or disk to avoid recomputing transformations for multiple actions.