How to Persist RDD in Spark: Syntax and Examples
To persist an RDD in Spark, use the
cache() method to store it in memory or persist() to specify storage levels like memory and disk. Persisting helps reuse the RDD across multiple actions without recomputing it each time.Syntax
Use rdd.cache() to store the RDD in memory by default. Use rdd.persist(storageLevel) to specify how and where to store the RDD, such as memory, disk, or both.
rdd: Your existing RDDcache(): Shortcut to persist in memory onlypersist(storageLevel): Persist with custom storage levelStorageLevel: Options like MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY
scala
rdd.cache() rdd.persist(StorageLevel.MEMORY_AND_DISK)
Example
This example shows how to create an RDD, persist it in memory and disk, and perform actions to demonstrate persistence.
scala
import org.apache.spark.sql.SparkSession import org.apache.spark.storage.StorageLevel object PersistRDDExample { def main(args: Array[String]): Unit = { val spark = SparkSession.builder.appName("PersistRDDExample").master("local").getOrCreate() val sc = spark.sparkContext val data = Seq(1, 2, 3, 4, 5) val rdd = sc.parallelize(data) // Persist RDD in memory and disk rdd.persist(StorageLevel.MEMORY_AND_DISK) // Perform an action to trigger computation and caching println("Sum: " + rdd.sum()) // Perform another action to show it uses persisted data println("Count: " + rdd.count()) // Unpersist when done rdd.unpersist() spark.stop() } }
Output
Sum: 15
Count: 5
Common Pitfalls
One common mistake is calling persist() or cache() but not triggering an action, so the RDD is not actually cached. Another is forgetting to unpersist() when the RDD is no longer needed, which wastes memory.
Also, using cache() stores only in memory, which can cause failures if data does not fit. Use persist(StorageLevel.MEMORY_AND_DISK) to avoid this.
scala
val rdd = sc.parallelize(1 to 1000) rdd.cache() // No action called, so no caching happens // Correct way: rdd.cache() rdd.count() // Action triggers caching // Remember to unpersist when done: rdd.unpersist()
Quick Reference
| Method | Description |
|---|---|
| cache() | Persist RDD in memory only (default) |
| persist(StorageLevel) | Persist RDD with specified storage level |
| unpersist() | Remove RDD from persistence storage |
| StorageLevel.MEMORY_ONLY | Store RDD only in memory |
| StorageLevel.MEMORY_AND_DISK | Store RDD in memory and disk if needed |
| StorageLevel.DISK_ONLY | Store RDD only on disk |
Key Takeaways
Use cache() to persist RDD in memory for faster reuse.
Use persist() with StorageLevel to control where RDD is stored.
Always trigger an action after persist() or cache() to materialize the RDD.
Call unpersist() to free resources when RDD is no longer needed.
Choose MEMORY_AND_DISK storage level to avoid failures if memory is insufficient.