beginner

What is caching in Apache Spark?

Caching in Apache Spark means storing a dataset in memory to speed up repeated access. It helps avoid recomputing the dataset every time it is used.

Click to reveal answer

beginner

What does persistence mean in Spark?

Persistence means saving a dataset to memory and/or disk with different storage levels. It allows Spark to reuse the dataset efficiently across multiple operations.

Click to reveal answer

intermediate

Name two common storage levels used in Spark persistence.

Two common storage levels are MEMORY_ONLY (store data only in memory) and MEMORY_AND_DISK (store data in memory, spill to disk if not enough memory).

Click to reveal answer

intermediate

How does caching improve performance in iterative algorithms?

Caching keeps the data in memory so iterative algorithms can reuse it without recomputing or reading from disk each time, making the process faster.

Click to reveal answer

intermediate

What is the difference between cache() and persist() in Spark?

cache() is shorthand for persist() with the default storage level MEMORY_AND_DISK. persist() lets you choose other storage levels like DISK_ONLY or MEMORY_ONLY_SER.

Click to reveal answer

What happens when you call cache() on a DataFrame in Spark?

AThe DataFrame is stored in memory for faster access.

BThe DataFrame is deleted from memory.

CThe DataFrame is saved permanently to disk.

DThe DataFrame is converted to a different format.

Which storage level stores data only on disk in Spark persistence?

AMEMORY_ONLY

BOFF_HEAP

CMEMORY_AND_DISK

DDISK_ONLY

Why would you use persist() instead of cache()?

ATo convert the dataset to JSON format.

BTo delete the dataset from memory.

CTo choose a specific storage level other than the default.

DTo speed up the first computation only.

What is a benefit of caching in iterative machine learning algorithms?

AIt increases disk usage.

BIt reduces repeated data loading and computation.

CIt slows down the algorithm.

DIt deletes intermediate results.

If memory is limited, which storage level helps avoid out-of-memory errors?

AMEMORY_AND_DISK

BMEMORY_ONLY

CDISK_ONLY

DNONE

Explain caching and persistence in Apache Spark and how they help improve performance.

Describe the difference between cache() and persist() methods in Spark with examples of when to use each.