0
0
Apache Sparkdata~20 mins

Caching and persistence in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Caching and Persistence Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output count after caching?

Consider the following Apache Spark code snippet:

data = spark.range(1000)
cached_data = data.cache()
count1 = cached_data.count()
count2 = cached_data.count()

What is the value of count2?

Apache Spark
data = spark.range(1000)
cached_data = data.cache()
count1 = cached_data.count()
count2 = cached_data.count()
print(count2)
ARaises an error
B0
CNone
D1000
Attempts:
2 left
💡 Hint

Caching stores the data in memory after the first action.

data_output
intermediate
2:00remaining
What is the storage level after persisting with MEMORY_AND_DISK?

Given this Spark code:

df = spark.range(10)
df.persist(StorageLevel.MEMORY_AND_DISK)
print(df.storageLevel)

What will be printed?

Apache Spark
from pyspark import StorageLevel
df = spark.range(10)
df.persist(StorageLevel.MEMORY_AND_DISK)
print(df.storageLevel)
AStorageLevel(memory=True, disk=True, offHeap=False, deserialized=True, replication=1)
BStorageLevel(memory=False, disk=True, offHeap=False, deserialized=True, replication=1)
CStorageLevel(memory=True, disk=False, offHeap=False, deserialized=True, replication=1)
DStorageLevel(memory=False, disk=False, offHeap=False, deserialized=False, replication=1)
Attempts:
2 left
💡 Hint

MEMORY_AND_DISK means data is stored in memory and spilled to disk if needed.

🔧 Debug
advanced
2:00remaining
Why does unpersisting not free memory immediately?

In Spark, after calling df.unpersist(), the memory is not freed immediately. Why?

ABecause unpersist() is asynchronous and frees memory lazily
BBecause unpersist() deletes the original data source
CBecause unpersist() caches data again automatically
DBecause unpersist() triggers a job that blocks memory release
Attempts:
2 left
💡 Hint

Think about how Spark manages memory and tasks asynchronously.

🚀 Application
advanced
2:00remaining
Choosing the right persistence level for iterative algorithms

You have a large DataFrame used in multiple iterations of a machine learning algorithm. Which persistence level is best to optimize performance and resource usage?

AStorageLevel.DISK_ONLY
BStorageLevel.MEMORY_AND_DISK_SER
CStorageLevel.MEMORY_ONLY
DStorageLevel.OFF_HEAP
Attempts:
2 left
💡 Hint

Consider serialization and fallback when memory is insufficient.

🧠 Conceptual
expert
3:00remaining
What happens if you cache a DataFrame but the cluster runs out of memory?

When you cache a DataFrame in Spark and the cluster memory is full, what is the expected behavior?

ASpark will automatically increase cluster memory to fit the cache
BSpark will crash immediately due to out-of-memory error
CSpark will evict cached partitions using LRU policy and recompute them when needed
DSpark will convert the DataFrame to disk-only storage without eviction
Attempts:
2 left
💡 Hint

Think about Spark's memory management and eviction policies.