Which option correctly describes the main difference between Hadoop MapReduce and Apache Spark in terms of data processing?
Think about how each system handles intermediate data during processing.
Hadoop MapReduce writes intermediate data to disk between map and reduce steps, which slows down processing. Spark keeps data in memory, making it faster especially for iterative tasks.
Which statement best explains how fault tolerance is handled differently in Hadoop and Spark?
Consider how each system recovers lost data after a failure.
Hadoop relies on replicating data blocks on HDFS to recover from failures. Spark tracks transformations (lineage) and can recompute lost data without replicating it.
Given the following scenario: A machine learning algorithm runs 10 iterations on a large dataset. Which framework will likely complete the task faster and why?
Think about how iterative algorithms benefit from in-memory data storage.
Spark caches data in memory, so iterative algorithms run faster by avoiding repeated disk reads/writes. Hadoop MapReduce writes intermediate data to disk each iteration, slowing down the process.
Which option correctly describes how Hadoop and Spark manage cluster resources?
Consider the flexibility of Spark in cluster environments.
Hadoop uses YARN as its default resource manager. Spark is more flexible and can run on YARN, Mesos, or its own standalone cluster manager.
You need to build a system that processes streaming data from sensors in near real-time and performs complex analytics. Which framework is best suited and why?
Think about which framework supports streaming and fast analytics.
Spark supports real-time streaming with Spark Streaming and Structured Streaming APIs, enabling low-latency analytics. Hadoop MapReduce is batch-oriented and not designed for real-time streaming.