0
0
Apache Sparkdata~5 mins

Caching and persistence in Apache Spark - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is caching in Apache Spark?
Caching in Apache Spark means storing a dataset in memory to speed up repeated access. It helps avoid recomputing the dataset every time it is used.
Click to reveal answer
beginner
What does persistence mean in Spark?
Persistence means saving a dataset to memory and/or disk with different storage levels. It allows Spark to reuse the dataset efficiently across multiple operations.
Click to reveal answer
intermediate
Name two common storage levels used in Spark persistence.
Two common storage levels are MEMORY_ONLY (store data only in memory) and MEMORY_AND_DISK (store data in memory, spill to disk if not enough memory).
Click to reveal answer
intermediate
How does caching improve performance in iterative algorithms?
Caching keeps the data in memory so iterative algorithms can reuse it without recomputing or reading from disk each time, making the process faster.
Click to reveal answer
intermediate
What is the difference between cache() and persist() in Spark?
cache() is shorthand for persist() with the default storage level MEMORY_AND_DISK. persist() lets you choose other storage levels like DISK_ONLY or MEMORY_ONLY_SER.
Click to reveal answer
What happens when you call cache() on a DataFrame in Spark?
AThe DataFrame is stored in memory for faster access.
BThe DataFrame is deleted from memory.
CThe DataFrame is saved permanently to disk.
DThe DataFrame is converted to a different format.
Which storage level stores data only on disk in Spark persistence?
AMEMORY_ONLY
BOFF_HEAP
CMEMORY_AND_DISK
DDISK_ONLY
Why would you use persist() instead of cache()?
ATo convert the dataset to JSON format.
BTo delete the dataset from memory.
CTo choose a specific storage level other than the default.
DTo speed up the first computation only.
What is a benefit of caching in iterative machine learning algorithms?
AIt increases disk usage.
BIt reduces repeated data loading and computation.
CIt slows down the algorithm.
DIt deletes intermediate results.
If memory is limited, which storage level helps avoid out-of-memory errors?
AMEMORY_AND_DISK
BMEMORY_ONLY
CDISK_ONLY
DNONE
Explain caching and persistence in Apache Spark and how they help improve performance.
Think about how storing data in memory or disk helps avoid recomputation.
You got /4 concepts.
    Describe the difference between cache() and persist() methods in Spark with examples of when to use each.
    Consider how you might want to store data differently depending on memory availability.
    You got /4 concepts.