0
0
Apache Sparkdata~10 mins

Caching and persistence in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Caching and persistence
Create RDD/DataFrame
Perform Transformations
Cache or Persist?
NoCompute on action
Yes
Store in Memory/Disk
Trigger Action (e.g., count)
Reuse Cached Data
Faster Subsequent Actions
Optionally Unpersist to Free Memory
This flow shows how Spark caches or persists data after transformations to speed up repeated actions by storing data in memory or disk.
Execution Sample
Apache Spark
rdd = sc.parallelize([1, 2, 3, 4])
rdd2 = rdd.map(lambda x: x * 2)
rdd2.cache()
print(rdd2.count())
print(rdd2.collect())
This code creates an RDD, doubles each element, caches the result, then counts and collects the cached data.
Execution Table
StepActionEvaluationResult
1Create RDD with [1,2,3,4]RDD created[1,2,3,4]
2Apply map to double each elementTransformation definedRDD with [2,4,6,8] (not computed yet)
3Call cache()Mark RDD for cachingRDD marked cached (lazy)
4Call count() actionTriggers computation and cachingCount = 4; RDD cached in memory
5Call collect() actionUses cached data[2,4,6,8] returned quickly
💡 All actions complete; cached data reused for collect after count
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4After Step 5
rdd[1,2,3,4][1,2,3,4][1,2,3,4][1,2,3,4][1,2,3,4]
rdd2undefinedRDD with [2,4,6,8] (lazy)Marked cachedCached in memoryCached in memory
Key Moments - 2 Insights
Why doesn't calling cache() immediately compute the RDD?
cache() only marks the RDD to be cached; actual computation happens when an action like count() triggers it, as shown in step 4 of the execution_table.
How does caching speed up the collect() action after count()?
Because the RDD was cached in memory during count(), collect() reuses this cached data instead of recomputing, making it faster (step 5).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the result of the count() action at step 4?
A4
B8
C[2,4,6,8]
DRDD marked cached
💡 Hint
Check the 'Result' column at step 4 in the execution_table.
At which step is the RDD actually cached in memory?
AStep 3
BStep 4
CStep 2
DStep 5
💡 Hint
Look for when caching is triggered by an action in the execution_table.
If we remove the cache() call, what would happen at step 5 when collect() is called?
ACollect would use cached data
BCount would fail
CCollect would recompute the RDD
DRDD would be cached automatically
💡 Hint
Refer to the role of cache() in the key_moments and execution_table.
Concept Snapshot
Caching and persistence in Spark:
- cache() marks data to store in memory
- persist() can store in memory/disk with levels
- Actual caching happens on action (e.g., count)
- Speeds up repeated actions by reusing data
- Use unpersist() to free memory when done
Full Transcript
In Apache Spark, caching and persistence help speed up repeated computations. When you create an RDD or DataFrame and apply transformations, the data is not computed immediately. Calling cache() or persist() marks the data to be stored in memory or disk after an action triggers computation. For example, calling count() computes and caches the data. Subsequent actions like collect() reuse this cached data, making them faster. This process avoids recomputing the same data multiple times. You can free memory by calling unpersist() when caching is no longer needed.