In Apache Spark, optimization helps reduce resource usage. Which of the following best explains how this happens?
Think about how fewer steps can save resources.
Optimization merges operations to reduce the number of passes over data, saving memory and CPU.
Consider the following Spark code that filters and maps data. What is the output count after optimization?
data = spark.sparkContext.parallelize(range(10)) filtered = data.filter(lambda x: x % 2 == 0) mapped = filtered.map(lambda x: x * 2) result = mapped.collect() print(len(result))
Count how many even numbers are between 0 and 9.
The filter keeps even numbers: 0,2,4,6,8 (5 numbers). Mapping doubles them but does not change count.
This Spark job fails with an out-of-memory error. What is the main cause?
rdd = sc.parallelize(range(1000000)) result = rdd.map(lambda x: x * 2).collect()
Think about what happens when collecting large data to the driver.
Collecting a large RDD to the driver can cause out-of-memory errors because all data is loaded into driver memory.
Given this Spark code, what is the output of the second action?
rdd = sc.parallelize([1,2,3,4]) rdd_cached = rdd.map(lambda x: x * 2).cache() count1 = rdd_cached.count() count2 = rdd_cached.count() print(count2)
Count returns the number of elements, not their sum.
Caching stores the RDD after first action. Both counts return the number of elements, which is 4.
You have a Spark job that fails during shuffle due to large data movement. Which optimization technique best prevents this failure?
Think about how to reduce data movement during joins.
Broadcast joins send small datasets to all nodes, reducing shuffle and preventing failures caused by large data movement.