How to fix out of memory spark in pyspark

Apache-sparkDebug / FixBeginner · 4 min read

Fix Out of Memory Error in PySpark: Simple Solutions

To fix OutOfMemoryError in PySpark, increase the executor and driver memory using spark.executor.memory and spark.driver.memory settings. Also, optimize your code by reducing data shuffles and caching only necessary data.

🔍

Why This Happens

Out of memory errors occur when Spark tries to process more data than the allocated memory can hold. This often happens with large datasets or inefficient operations that cause excessive data shuffling or caching.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('MemoryErrorExample').getOrCreate()
data = spark.range(1_000_000_000)  # Very large dataset

# This action may cause out of memory error if memory is insufficient
count = data.groupBy('id').count().collect()

Output

java.lang.OutOfMemoryError: Java heap space

🔧

The Fix

Increase memory settings for executors and driver to give Spark more memory to work with. Also, optimize your code by reducing unnecessary shuffles and caching only when needed.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName('MemoryFixExample') \
    .config('spark.executor.memory', '4g') \
    .config('spark.driver.memory', '4g') \
    .getOrCreate()

data = spark.range(1_000_000_000)

# Use repartition to reduce shuffle size and cache only if needed
data_reduced = data.repartition(100)

count = data_reduced.groupBy('id').count().limit(10).collect()
print(count)

Output

[Row(id=0, count=1), Row(id=1, count=1), Row(id=2, count=1), Row(id=3, count=1), Row(id=4, count=1), Row(id=5, count=1), Row(id=6, count=1), Row(id=7, count=1), Row(id=8, count=1), Row(id=9, count=1)]

🛡️

Prevention

Set appropriate memory sizes for executors and driver based on your cluster resources.
Avoid caching large datasets unless necessary.
Use repartition() or coalesce() to control shuffle size.
Filter data early to reduce the amount processed.
Monitor Spark UI to identify memory bottlenecks.

⚠️

Related Errors

Other common memory-related errors include:

GC overhead limit exceeded: JVM spends too much time garbage collecting. Fix by increasing memory or optimizing code.
Task not serializable: Happens when Spark tries to send non-serializable objects to executors.
Shuffle memory errors: Caused by large shuffle operations; fix by tuning shuffle partitions.

✅

Key Takeaways

Increase executor and driver memory settings to handle large data.

Optimize Spark jobs by reducing shuffles and caching only necessary data.

Use repartitioning to control data distribution and memory usage.

Filter data early to minimize memory load.

Monitor Spark UI to detect and fix memory bottlenecks.