Consider the following Apache Spark code snippet:
data = spark.range(1000) cached_data = data.cache() count1 = cached_data.count() count2 = cached_data.count()
What is the value of count2?
data = spark.range(1000) cached_data = data.cache() count1 = cached_data.count() count2 = cached_data.count() print(count2)
Caching stores the data in memory after the first action.
The first count() triggers computation and caches the data. The second count() uses the cached data, so it returns 1000 immediately.
Given this Spark code:
df = spark.range(10) df.persist(StorageLevel.MEMORY_AND_DISK) print(df.storageLevel)
What will be printed?
from pyspark import StorageLevel df = spark.range(10) df.persist(StorageLevel.MEMORY_AND_DISK) print(df.storageLevel)
MEMORY_AND_DISK means data is stored in memory and spilled to disk if needed.
The persist(StorageLevel.MEMORY_AND_DISK) sets the storage level to keep data in memory and disk with deserialized format.
In Spark, after calling df.unpersist(), the memory is not freed immediately. Why?
Think about how Spark manages memory and tasks asynchronously.
Calling unpersist() marks the data for removal but frees memory lazily after current tasks finish.
You have a large DataFrame used in multiple iterations of a machine learning algorithm. Which persistence level is best to optimize performance and resource usage?
Consider serialization and fallback when memory is insufficient.
MEMORY_AND_DISK_SER stores serialized data in memory and spills to disk if needed, saving memory and improving performance for iterative tasks.
When you cache a DataFrame in Spark and the cluster memory is full, what is the expected behavior?
Think about Spark's memory management and eviction policies.
Spark uses a Least Recently Used (LRU) policy to evict cached partitions when memory is full, allowing recomputation later.