What is Caching and persistence in Apache Spark?

Apache Sparkdata~5 mins

Caching and persistence in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Caching and persistence help keep data ready in memory or disk so Spark can use it faster next time.

When you run the same data operations multiple times and want to save time.

When your data is big but fits in memory, so you want faster access.

When you want to avoid repeating slow data loading or calculations.

When you want to keep intermediate results during complex data processing.

When you want to balance speed and storage by choosing memory or disk.

Syntax

Apache Spark

dataframe.cache()
dataframe.persist(StorageLevel.MEMORY_AND_DISK)

# To remove cached data:
dataframe.unpersist()

cache() is shorthand for persist(StorageLevel.MEMORY_ONLY).

You can choose different storage levels like MEMORY_ONLY, DISK_ONLY, or MEMORY_AND_DISK.

Examples

This caches the DataFrame in memory for faster reuse.

Apache Spark

df.cache()

This keeps the DataFrame only in memory, which is faster but may fail if memory is low.

Apache Spark

from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_ONLY)

This removes the cached or persisted data to free up resources.

Apache Spark

df.unpersist()

Sample Program

This program creates a DataFrame, caches it, counts rows to trigger caching, then persists it with MEMORY_ONLY storage, shows the data, and finally unpersists it to free memory.

Apache Spark

from pyspark.sql import SparkSession
from pyspark import StorageLevel

spark = SparkSession.builder.appName('CachingExample').getOrCreate()

# Create a simple DataFrame
data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')]
df = spark.createDataFrame(data, ['id', 'fruit'])

# Cache the DataFrame
print('Caching DataFrame...')
df.cache()

# Perform an action to trigger caching
print('Count:', df.count())

# Persist with MEMORY_ONLY
print('Persisting DataFrame with MEMORY_ONLY...')
df.persist(StorageLevel.MEMORY_ONLY)

# Perform another action
print('Show DataFrame:')
df.show()

# Unpersist to free cache
print('Unpersisting DataFrame...')
df.unpersist()

spark.stop()

OutputSuccess

Important Notes

Caching only happens after an action like count() or show() is called.

Persisting lets you pick storage type; caching is a common default.

Unpersist frees memory and disk space used by cached data.

Summary

Caching and persistence speed up repeated data use by storing data in memory or disk.

Use cache() for simple caching, persist() to choose storage.

Always unpersist when done to save resources.