0
0
Apache Sparkdata~5 mins

Lazy evaluation in Spark in Apache Spark

Choose your learning style9 modes available
Introduction

Lazy evaluation means Spark waits to run your code until it really needs to. This saves time and computer power.

When you want to build a data processing plan without running it immediately.
When you want to combine many data steps before actually doing the work.
When you want Spark to optimize your data tasks for faster results.
When you want to avoid wasting resources on unnecessary calculations.
Syntax
Apache Spark
val data = spark.read.csv("file.csv")
val filtered = data.filter("age > 30")
// No action here, just transformations
val count = filtered.count() // Action triggers execution

Transformations like filter are lazy and do not run immediately.

Actions like count start the actual data processing.

Examples
This example shows how transformations are lazy and only run when an action like count() is called.
Apache Spark
val rdd = spark.sparkContext.textFile("data.txt")
val words = rdd.flatMap(line => line.split(" "))
// No job runs yet
val wordCount = words.count() // Action triggers job
Here, filter is lazy. The show() action runs the query and displays results.
Apache Spark
val df = spark.read.json("people.json")
val adults = df.filter("age >= 18")
// No execution yet
adults.show() // Action triggers execution and shows data
Sample Program

This code creates a small table, filters people older than 30 (lazy step), then counts them (action). The count triggers Spark to run the filter and count.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LazyEvalExample").getOrCreate()

# Load data
data = spark.createDataFrame([
    (1, "Alice", 29),
    (2, "Bob", 35),
    (3, "Cathy", 23)
], ["id", "name", "age"])

# Transformation (lazy)
adults = data.filter("age > 30")

# No computation yet

# Action triggers computation
count = adults.count()
print(f"Number of adults older than 30: {count}")

spark.stop()
OutputSuccess
Important Notes

Lazy evaluation helps Spark optimize by combining steps before running.

Actions include count(), collect(), show(), and others.

Transformations include filter(), map(), select(), etc.

Summary

Spark waits to run your data steps until an action is called.

This saves time and resources by optimizing the work.

Remember: transformations are lazy, actions trigger execution.