Which of the following reasons best explains why Apache Spark is generally faster than MapReduce for big data processing?
Think about how data is stored and accessed during processing.
Spark keeps data in memory between operations, which avoids slow disk reads and writes that MapReduce does for every step. This makes Spark much faster.
Why is Apache Spark better suited for iterative algorithms like machine learning compared to MapReduce?
Consider how data is reused during multiple passes over the same dataset.
Spark can keep datasets in memory during multiple iterations, avoiding repeated disk reads. MapReduce must read and write data to disk every iteration, slowing it down.
Given the following Spark code snippet, what is the output collected in the driver?
rdd = sc.parallelize([1, 2, 3, 4, 5]) result = rdd.filter(lambda x: x % 2 == 0).map(lambda x: x * x).collect() print(result)
Filter keeps even numbers, then map squares them.
The filter keeps only even numbers [2,4]. Then map squares them to [4,16].
What feature of Apache Spark helps it recover from failures efficiently without writing intermediate data to disk like MapReduce?
Think about how Spark knows how to rebuild lost data partitions.
Spark uses RDD lineage, a record of transformations, to recompute lost data partitions on failure without needing to write all intermediate data to disk.
You have a big dataset and need to run complex machine learning algorithms that require multiple passes over the data. Which reason best justifies choosing Apache Spark over MapReduce?
Consider the nature of iterative algorithms and data access speed.
Iterative algorithms need to reuse data many times. Spark's in-memory caching avoids slow disk I/O, making it much faster than MapReduce for these tasks.