Apache-sparkHow-ToBeginner · 4 min read

How to Use Cache for Performance in PySpark

In PySpark, you can use cache() on a DataFrame or RDD to store it in memory, which speeds up repeated operations by avoiding recomputation. Use persist() with different storage levels for more control over caching behavior.

📐

Syntax

The basic syntax to cache a DataFrame or RDD is simple:

df.cache() - stores the DataFrame in memory for faster access.
df.persist(storageLevel) - stores the DataFrame with a specified storage level like memory, disk, or both.
df.unpersist() - removes the cached data from memory.

Caching helps when you reuse the same DataFrame multiple times in your code.

python

df.cache()
df.persist(StorageLevel.MEMORY_AND_DISK)
df.unpersist()

💻

Example

This example shows how caching improves performance by storing a DataFrame in memory before running multiple actions.

python

from pyspark.sql import SparkSession
from pyspark import StorageLevel
import time

spark = SparkSession.builder.appName('CacheExample').getOrCreate()

# Create a DataFrame
numbers = spark.range(1, 10000000)

# Cache the DataFrame
numbers.cache()

# First action triggers computation and caching
start = time.time()
count1 = numbers.count()
end = time.time()
print(f'First count took {end - start:.2f} seconds')

# Second action uses cached data, faster
start = time.time()
count2 = numbers.filter('id % 2 == 0').count()
end = time.time()
print(f'Second count took {end - start:.2f} seconds')

spark.stop()

Output

First count took X.XX seconds Second count took Y.YY seconds

⚠️

Common Pitfalls

Common mistakes when using cache in PySpark include:

Not triggering an action after cache(), so caching does not happen immediately.
Caching very large DataFrames without enough memory, causing spills to disk and slowing performance.
Forgetting to unpersist() cached data when it is no longer needed, wasting memory.

Always monitor your cluster memory and cache only data reused multiple times.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('CachePitfall').getOrCreate()

# Wrong: cache called but no action, so no caching happens
df = spark.range(1000000)
df.cache()
# No action here, so cache is not materialized

# Right: trigger an action to materialize cache
count = df.count()  # This triggers caching

spark.stop()

📊

Quick Reference

Method	Description
cache()	Store DataFrame/RDD in memory for faster reuse
persist(storageLevel)	Store with specified storage level (memory, disk, etc.)
unpersist()	Remove cached data from memory
StorageLevel.MEMORY_ONLY	Cache only in memory, fastest but risky if memory is low
StorageLevel.MEMORY_AND_DISK	Cache in memory and spill to disk if needed

✅

Key Takeaways

Use cache() to store DataFrames or RDDs in memory for faster repeated access.

Always trigger an action after cache() to materialize the cache.

Use persist() with storage levels for more control over caching behavior.

Unpersist cached data when no longer needed to free memory.

Monitor memory usage to avoid performance degradation from excessive caching.