0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Persist RDD in Spark: Syntax and Examples

To persist an RDD in Spark, use the cache() method to store it in memory or persist() to specify storage levels like memory and disk. Persisting helps reuse the RDD across multiple actions without recomputing it each time.
๐Ÿ“

Syntax

Use rdd.cache() to store the RDD in memory by default. Use rdd.persist(storageLevel) to specify how and where to store the RDD, such as memory, disk, or both.

  • rdd: Your existing RDD
  • cache(): Shortcut to persist in memory only
  • persist(storageLevel): Persist with custom storage level
  • StorageLevel: Options like MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY
scala
rdd.cache()
rdd.persist(StorageLevel.MEMORY_AND_DISK)
๐Ÿ’ป

Example

This example shows how to create an RDD, persist it in memory and disk, and perform actions to demonstrate persistence.

scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel

object PersistRDDExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("PersistRDDExample").master("local").getOrCreate()
    val sc = spark.sparkContext

    val data = Seq(1, 2, 3, 4, 5)
    val rdd = sc.parallelize(data)

    // Persist RDD in memory and disk
    rdd.persist(StorageLevel.MEMORY_AND_DISK)

    // Perform an action to trigger computation and caching
    println("Sum: " + rdd.sum())

    // Perform another action to show it uses persisted data
    println("Count: " + rdd.count())

    // Unpersist when done
    rdd.unpersist()

    spark.stop()
  }
}
Output
Sum: 15 Count: 5
โš ๏ธ

Common Pitfalls

One common mistake is calling persist() or cache() but not triggering an action, so the RDD is not actually cached. Another is forgetting to unpersist() when the RDD is no longer needed, which wastes memory.

Also, using cache() stores only in memory, which can cause failures if data does not fit. Use persist(StorageLevel.MEMORY_AND_DISK) to avoid this.

scala
val rdd = sc.parallelize(1 to 1000)
rdd.cache() // No action called, so no caching happens

// Correct way:
rdd.cache()
rdd.count() // Action triggers caching

// Remember to unpersist when done:
rdd.unpersist()
๐Ÿ“Š

Quick Reference

MethodDescription
cache()Persist RDD in memory only (default)
persist(StorageLevel)Persist RDD with specified storage level
unpersist()Remove RDD from persistence storage
StorageLevel.MEMORY_ONLYStore RDD only in memory
StorageLevel.MEMORY_AND_DISKStore RDD in memory and disk if needed
StorageLevel.DISK_ONLYStore RDD only on disk
โœ…

Key Takeaways

Use cache() to persist RDD in memory for faster reuse.
Use persist() with StorageLevel to control where RDD is stored.
Always trigger an action after persist() or cache() to materialize the RDD.
Call unpersist() to free resources when RDD is no longer needed.
Choose MEMORY_AND_DISK storage level to avoid failures if memory is insufficient.