Which of the following best describes a Resilient Distributed Dataset (RDD) in Apache Spark?
Think about how Spark handles data across multiple machines and recovers from failures.
An RDD is a fundamental Spark data structure that is distributed across cluster nodes and can recover from node failures automatically, making it fault-tolerant.
What does it mean that RDDs are immutable in Apache Spark?
Consider how Spark manages data consistency and fault tolerance.
RDDs are immutable, meaning you cannot change their data after creation. Instead, transformations produce new RDDs, which helps with fault tolerance and consistency.
What is the output of the following Spark code snippet?
rdd = sc.parallelize([1, 2, 3, 4, 5]) result = rdd.filter(lambda x: x % 2 == 0).collect() print(result)
Filter keeps elements where the condition is true.
The filter keeps only even numbers (divisible by 2), so the output is [2, 4].
Given the following RDD and action, what is the output?
rdd = sc.parallelize([10, 20, 30, 40]) sum_value = rdd.reduce(lambda a, b: a + b) print(sum_value)
Reduce combines all elements using the given function.
The reduce sums all elements: 10 + 20 + 30 + 40 = 100.
What error will the following Spark code produce?
rdd = sc.parallelize([1, 2, 3]) result = rdd.map(lambda x: x / 0).collect() print(result)
Consider what happens when dividing by zero in Python.
Dividing by zero raises a ZeroDivisionError in Python, so the Spark job will fail with this error.