Apache-sparkComparisonBeginner · 3 min read

Transformation vs Action in Spark: Key Differences and Usage

In Apache Spark, transformations are lazy operations that define a new dataset from an existing one without executing immediately, while actions trigger the execution of these transformations and return results or write data. Transformations build the processing plan, and actions run it to produce output.

⚖️

Quick Comparison

This table summarizes the main differences between transformations and actions in Spark.

Aspect	Transformation	Action
Execution	Lazy (not executed immediately)	Eager (triggers execution)
Purpose	Defines new RDD/DataFrame from existing one	Returns result or writes data
Return Type	New RDD/DataFrame (deferred)	Value or side effect (e.g., count, collect)
Effect on Data	No immediate change	Computes and materializes data
Examples	map(), filter(), select()	count(), collect(), saveAsTextFile()
Use in Workflow	Build processing pipeline	Run pipeline and get output

⚖️

Key Differences

Transformations in Spark are operations like map(), filter(), and select() that create a new dataset from an existing one. They are lazy, meaning Spark only records the steps but does not run them immediately. This helps optimize the overall data processing by combining steps before execution.

Actions such as count(), collect(), and saveAsTextFile() trigger Spark to execute all the recorded transformations. Actions produce a result, like a number or a collected list, or cause a side effect like saving data to storage.

In short, transformations build the plan, and actions run the plan. Without actions, Spark does not process data, which makes transformations efficient for chaining multiple steps without overhead.

⚖️

Code Comparison

Here is an example showing a transformation in Spark using PySpark to filter even numbers from a list.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TransformationExample").getOrCreate()
data = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6])

# Transformation: filter even numbers (lazy, no execution yet)
even_numbers = data.filter(lambda x: x % 2 == 0)

# No output yet because no action called

spark.stop()

↔️

Action Equivalent

Now, adding an action to trigger execution and collect the filtered results.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ActionExample").getOrCreate()
data = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6])

# Transformation
filtered = data.filter(lambda x: x % 2 == 0)

# Action: collect triggers execution and returns result
result = filtered.collect()
print(result)  # Output: [2, 4, 6]

spark.stop()

Output

[2, 4, 6]

🎯

When to Use Which

Choose transformations when you want to build or modify your data processing steps without running them immediately. This lets Spark optimize the entire workflow before execution.

Choose actions when you need to get results, save data, or trigger the actual computation. Without actions, Spark will not process the data.

In practice, use transformations to prepare your data and actions to finalize and retrieve outputs.

✅

Key Takeaways

Transformations are lazy and define new datasets without immediate execution.

Actions trigger execution and return results or cause side effects.

Use transformations to build your data pipeline efficiently.

Use actions to run the pipeline and get or save results.

Without actions, Spark does not process data despite transformations.