How to Use Cache for Performance in PySpark
In PySpark, you can use
cache() on a DataFrame or RDD to store it in memory, which speeds up repeated operations by avoiding recomputation. Use persist() with different storage levels for more control over caching behavior.Syntax
The basic syntax to cache a DataFrame or RDD is simple:
df.cache()- stores the DataFrame in memory for faster access.df.persist(storageLevel)- stores the DataFrame with a specified storage level like memory, disk, or both.df.unpersist()- removes the cached data from memory.
Caching helps when you reuse the same DataFrame multiple times in your code.
python
df.cache() df.persist(StorageLevel.MEMORY_AND_DISK) df.unpersist()
Example
This example shows how caching improves performance by storing a DataFrame in memory before running multiple actions.
python
from pyspark.sql import SparkSession from pyspark import StorageLevel import time spark = SparkSession.builder.appName('CacheExample').getOrCreate() # Create a DataFrame numbers = spark.range(1, 10000000) # Cache the DataFrame numbers.cache() # First action triggers computation and caching start = time.time() count1 = numbers.count() end = time.time() print(f'First count took {end - start:.2f} seconds') # Second action uses cached data, faster start = time.time() count2 = numbers.filter('id % 2 == 0').count() end = time.time() print(f'Second count took {end - start:.2f} seconds') spark.stop()
Output
First count took X.XX seconds
Second count took Y.YY seconds
Common Pitfalls
Common mistakes when using cache in PySpark include:
- Not triggering an action after
cache(), so caching does not happen immediately. - Caching very large DataFrames without enough memory, causing spills to disk and slowing performance.
- Forgetting to
unpersist()cached data when it is no longer needed, wasting memory.
Always monitor your cluster memory and cache only data reused multiple times.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('CachePitfall').getOrCreate() # Wrong: cache called but no action, so no caching happens df = spark.range(1000000) df.cache() # No action here, so cache is not materialized # Right: trigger an action to materialize cache count = df.count() # This triggers caching spark.stop()
Quick Reference
| Method | Description |
|---|---|
| cache() | Store DataFrame/RDD in memory for faster reuse |
| persist(storageLevel) | Store with specified storage level (memory, disk, etc.) |
| unpersist() | Remove cached data from memory |
| StorageLevel.MEMORY_ONLY | Cache only in memory, fastest but risky if memory is low |
| StorageLevel.MEMORY_AND_DISK | Cache in memory and spill to disk if needed |
Key Takeaways
Use cache() to store DataFrames or RDDs in memory for faster repeated access.
Always trigger an action after cache() to materialize the cache.
Use persist() with storage levels for more control over caching behavior.
Unpersist cached data when no longer needed to free memory.
Monitor memory usage to avoid performance degradation from excessive caching.