Which statement best describes the main difference between Apache Spark and Hadoop MapReduce in how they process data?
Think about how each framework handles intermediate data during processing.
Spark keeps data in memory between steps, making it faster. Hadoop MapReduce writes data to disk after each step, which slows it down.
What will be the output of this Spark code snippet?
rdd = sc.parallelize([1, 2, 3, 4]) result = rdd.map(lambda x: x * 2).filter(lambda x: x > 4).collect() print(result)
First multiply each number by 2, then keep only those greater than 4.
Mapping doubles each number: [2,4,6,8]. Filtering keeps numbers >4: [6,8].
Given a large dataset, which framework will typically use less disk I/O during iterative machine learning tasks?
Consider how iterative tasks benefit from caching.
Spark caches data in memory, reducing disk reads and writes during iterations. Hadoop MapReduce writes to disk each time, increasing disk I/O.
A Spark job is running slower than expected compared to a similar Hadoop MapReduce job. Which of the following is the most likely cause?
Think about how Spark handles repeated computations without caching.
If Spark does not cache data, it recomputes RDDs each time they are used, causing slowdowns. Hadoop MapReduce writes to disk but does not recompute.
You need to process streaming data with low latency for real-time analytics. Which framework is best suited for this task?
Consider which framework supports streaming and low latency.
Spark Streaming processes data in near real-time using in-memory computations. Hadoop MapReduce is designed for batch jobs and is not suitable for real-time streaming.