Caching and persistence help keep data ready in memory or disk so Spark can use it faster next time.
0
0
Caching and persistence in Apache Spark
Introduction
When you run the same data operations multiple times and want to save time.
When your data is big but fits in memory, so you want faster access.
When you want to avoid repeating slow data loading or calculations.
When you want to keep intermediate results during complex data processing.
When you want to balance speed and storage by choosing memory or disk.
Syntax
Apache Spark
dataframe.cache()
dataframe.persist(StorageLevel.MEMORY_AND_DISK)
# To remove cached data:
dataframe.unpersist()cache() is shorthand for persist(StorageLevel.MEMORY_ONLY).
You can choose different storage levels like MEMORY_ONLY, DISK_ONLY, or MEMORY_AND_DISK.
Examples
This caches the DataFrame in memory for faster reuse.
Apache Spark
df.cache()
This keeps the DataFrame only in memory, which is faster but may fail if memory is low.
Apache Spark
from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_ONLY)
This removes the cached or persisted data to free up resources.
Apache Spark
df.unpersist()
Sample Program
This program creates a DataFrame, caches it, counts rows to trigger caching, then persists it with MEMORY_ONLY storage, shows the data, and finally unpersists it to free memory.
Apache Spark
from pyspark.sql import SparkSession from pyspark import StorageLevel spark = SparkSession.builder.appName('CachingExample').getOrCreate() # Create a simple DataFrame data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')] df = spark.createDataFrame(data, ['id', 'fruit']) # Cache the DataFrame print('Caching DataFrame...') df.cache() # Perform an action to trigger caching print('Count:', df.count()) # Persist with MEMORY_ONLY print('Persisting DataFrame with MEMORY_ONLY...') df.persist(StorageLevel.MEMORY_ONLY) # Perform another action print('Show DataFrame:') df.show() # Unpersist to free cache print('Unpersisting DataFrame...') df.unpersist() spark.stop()
OutputSuccess
Important Notes
Caching only happens after an action like count() or show() is called.
Persisting lets you pick storage type; caching is a common default.
Unpersist frees memory and disk space used by cached data.
Summary
Caching and persistence speed up repeated data use by storing data in memory or disk.
Use cache() for simple caching, persist() to choose storage.
Always unpersist when done to save resources.