Apache Sparkdata~10 mins

Caching and persistence in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Caching and persistence

Create RDD/DataFrame

↓

Perform Transformations

↓

Cache or Persist?

No→Compute on action

Yes↓

Store in Memory/Disk

↓

Trigger Action (e.g., count)

↓

Reuse Cached Data

↓

Faster Subsequent Actions

↓

Optionally Unpersist to Free Memory

This flow shows how Spark caches or persists data after transformations to speed up repeated actions by storing data in memory or disk.

Execution Sample

Apache Spark

rdd = sc.parallelize([1, 2, 3, 4])
rdd2 = rdd.map(lambda x: x * 2)
rdd2.cache()
print(rdd2.count())
print(rdd2.collect())

This code creates an RDD, doubles each element, caches the result, then counts and collects the cached data.

Execution Table

Step	Action	Evaluation	Result
1	Create RDD with [1,2,3,4]	RDD created	[1,2,3,4]
2	Apply map to double each element	Transformation defined	RDD with [2,4,6,8] (not computed yet)
3	Call cache()	Mark RDD for caching	RDD marked cached (lazy)
4	Call count() action	Triggers computation and caching	Count = 4; RDD cached in memory
5	Call collect() action	Uses cached data	[2,4,6,8] returned quickly

💡 All actions complete; cached data reused for collect after count

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 4	After Step 5
rdd	[1,2,3,4]	[1,2,3,4]	[1,2,3,4]	[1,2,3,4]	[1,2,3,4]
rdd2	undefined	RDD with [2,4,6,8] (lazy)	Marked cached	Cached in memory	Cached in memory

Key Moments - 2 Insights

Why doesn't calling cache() immediately compute the RDD?

How does caching speed up the collect() action after count()?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, what is the result of the count() action at step 4?

C[2,4,6,8]

DRDD marked cached

Concept Snapshot

Caching and persistence in Spark:
- cache() marks data to store in memory
- persist() can store in memory/disk with levels
- Actual caching happens on action (e.g., count)
- Speeds up repeated actions by reusing data
- Use unpersist() to free memory when done

Full Transcript

In Apache Spark, caching and persistence help speed up repeated computations. When you create an RDD or DataFrame and apply transformations, the data is not computed immediately. Calling cache() or persist() marks the data to be stored in memory or disk after an action triggers computation. For example, calling count() computes and caches the data. Subsequent actions like collect() reuse this cached data, making them faster. This process avoids recomputing the same data multiple times. You can free memory by calling unpersist() when caching is no longer needed.