Recall & Review
beginner
What is caching in Apache Spark?
Caching in Apache Spark means storing a dataset in memory to speed up repeated access. It helps avoid recomputing the dataset every time it is used.
Click to reveal answer
beginner
What does persistence mean in Spark?
Persistence means saving a dataset to memory and/or disk with different storage levels. It allows Spark to reuse the dataset efficiently across multiple operations.
Click to reveal answer
intermediate
Name two common storage levels used in Spark persistence.
Two common storage levels are MEMORY_ONLY (store data only in memory) and MEMORY_AND_DISK (store data in memory, spill to disk if not enough memory).
Click to reveal answer
intermediate
How does caching improve performance in iterative algorithms?
Caching keeps the data in memory so iterative algorithms can reuse it without recomputing or reading from disk each time, making the process faster.
Click to reveal answer
intermediate
What is the difference between cache() and persist() in Spark?
cache() is shorthand for persist() with the default storage level MEMORY_AND_DISK. persist() lets you choose other storage levels like DISK_ONLY or MEMORY_ONLY_SER.
Click to reveal answer
What happens when you call cache() on a DataFrame in Spark?
✗ Incorrect
cache() stores the DataFrame in memory to speed up future operations.
Which storage level stores data only on disk in Spark persistence?
✗ Incorrect
DISK_ONLY stores the dataset only on disk, not in memory.
Why would you use persist() instead of cache()?
✗ Incorrect
persist() allows selecting different storage levels like MEMORY_ONLY or DISK_ONLY.
What is a benefit of caching in iterative machine learning algorithms?
✗ Incorrect
Caching avoids recomputing data in each iteration, speeding up the process.
If memory is limited, which storage level helps avoid out-of-memory errors?
✗ Incorrect
MEMORY_AND_DISK stores data in memory and spills to disk if memory is full, preventing errors.
Explain caching and persistence in Apache Spark and how they help improve performance.
Think about how storing data in memory or disk helps avoid recomputation.
You got /4 concepts.
Describe the difference between cache() and persist() methods in Spark with examples of when to use each.
Consider how you might want to store data differently depending on memory availability.
You got /4 concepts.