Apache-sparkConceptBeginner · 3 min read

What is Lazy Evaluation in Spark: Explained Simply

In Apache Spark, lazy evaluation means that Spark delays running computations until an action is called. Instead of executing each step immediately, Spark builds a plan of transformations and runs them all at once when needed, which saves time and resources.

⚙️

How It Works

Lazy evaluation in Spark works like planning a trip before you start traveling. Imagine you want to visit several places, but instead of going to each place immediately, you first write down all the stops you want to make. Only when your plan is complete do you start the journey, following the best route.

Similarly, Spark does not run each data operation as soon as you write it. Instead, it remembers all the steps (called transformations) you want to perform on your data. When you finally ask for a result (an action), Spark looks at the whole plan and figures out the best way to run all steps together efficiently.

This approach avoids unnecessary work, like visiting the same place twice, and helps Spark optimize the process to run faster and use less memory.

💻

Example

This example shows lazy evaluation by creating a Spark DataFrame, applying transformations, and only running the computation when an action is called.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('LazyEvalExample').getOrCreate()

# Create a DataFrame
data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')]
df = spark.createDataFrame(data, ['id', 'fruit'])

# Transformation: filter rows where id > 1
filtered_df = df.filter(df.id > 1)

# Transformation: select only the 'fruit' column
selected_df = filtered_df.select('fruit')

# No computation has run yet because no action is called

# Action: collect the results to the driver
result = selected_df.collect()

# Print the result
print(result)
spark.stop()

Output

[Row(fruit='banana'), Row(fruit='cherry')]

🎯

When to Use

Lazy evaluation is useful whenever you work with large datasets in Spark. It helps by:

Reducing unnecessary computations by combining steps.
Optimizing the execution plan to run faster and use less memory.
Allowing you to build complex data processing pipelines without running intermediate steps.

For example, if you want to clean data, filter it, and then aggregate results, Spark waits until you ask for the final output before running all steps together. This is especially helpful in big data jobs where efficiency matters.

✅

Key Points

Lazy evaluation means Spark delays running transformations until an action is called.
This allows Spark to optimize the entire data processing plan.
Transformations are like instructions; actions trigger execution.
It improves performance and resource use in big data processing.

✅

Key Takeaways

Lazy evaluation delays execution until an action triggers it, saving resources.

Transformations build a plan but do not run immediately in Spark.

Actions like collect() or count() cause Spark to execute the plan.

This approach helps Spark optimize and speed up big data processing.