beginner

What is an accumulator variable in Apache Spark?

An accumulator variable is a special variable used to count or sum values across multiple tasks in a distributed Spark job. It helps track information like counts or sums safely from many workers.

Click to reveal answer

beginner

How do accumulator variables behave in Spark tasks?

Accumulator variables can only be added to inside tasks. They do not change the original value in the driver until the job finishes. This prevents errors from multiple updates in parallel.

Click to reveal answer

intermediate

Why should accumulator variables only be used for adding or counting?

Because accumulators are designed to safely add values from many tasks, using them for other operations like subtraction or multiplication can cause incorrect results due to task retries or parallelism.

Click to reveal answer

beginner

Show a simple example of creating and using an accumulator in Spark with Python.

In PySpark, you create an accumulator with sc.accumulator(0). Then inside an RDD operation, you add to it like accum.add(1). After the job, you get the value with accum.value.

Click to reveal answer

intermediate

What happens if you try to read an accumulator's value inside a Spark task?

You cannot reliably read the accumulator's value inside tasks because it is only updated in the driver after the job completes. Reading it inside tasks may give wrong or zero values.

Click to reveal answer

What is the main use of accumulator variables in Spark?

ATo shuffle data between nodes

BTo store large datasets in memory

CTo safely count or sum values across distributed tasks

DTo replace broadcast variables

Which operation is safe to perform on accumulators inside Spark tasks?

AAddition

BSubtraction

CMultiplication

DDivision

When can you reliably read the value of an accumulator in Spark?

ABefore starting the job

BAfter the Spark job finishes

CInside each task

DDuring task execution

What happens if a Spark task is retried when using accumulators?

AAccumulator resets automatically

BTask fails immediately

CAccumulator values are ignored

DAccumulator updates may be counted multiple times

Which Spark context method is used to create an accumulator in PySpark?

Asc.accumulator()

Bsc.broadcast()

Csc.parallelize()

Dsc.collect()

Explain what accumulator variables are and why they are useful in Spark.

Describe best practices and limitations when using accumulator variables in Spark.