Transformation vs Action in Spark: Key Differences and Usage
transformations are lazy operations that define a new dataset from an existing one without executing immediately, while actions trigger the execution of these transformations and return results or write data. Transformations build the processing plan, and actions run it to produce output.Quick Comparison
This table summarizes the main differences between transformations and actions in Spark.
| Aspect | Transformation | Action |
|---|---|---|
| Execution | Lazy (not executed immediately) | Eager (triggers execution) |
| Purpose | Defines new RDD/DataFrame from existing one | Returns result or writes data |
| Return Type | New RDD/DataFrame (deferred) | Value or side effect (e.g., count, collect) |
| Effect on Data | No immediate change | Computes and materializes data |
| Examples | map(), filter(), select() | count(), collect(), saveAsTextFile() |
| Use in Workflow | Build processing pipeline | Run pipeline and get output |
Key Differences
Transformations in Spark are operations like map(), filter(), and select() that create a new dataset from an existing one. They are lazy, meaning Spark only records the steps but does not run them immediately. This helps optimize the overall data processing by combining steps before execution.
Actions such as count(), collect(), and saveAsTextFile() trigger Spark to execute all the recorded transformations. Actions produce a result, like a number or a collected list, or cause a side effect like saving data to storage.
In short, transformations build the plan, and actions run the plan. Without actions, Spark does not process data, which makes transformations efficient for chaining multiple steps without overhead.
Code Comparison
Here is an example showing a transformation in Spark using PySpark to filter even numbers from a list.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("TransformationExample").getOrCreate() data = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6]) # Transformation: filter even numbers (lazy, no execution yet) even_numbers = data.filter(lambda x: x % 2 == 0) # No output yet because no action called spark.stop()
Action Equivalent
Now, adding an action to trigger execution and collect the filtered results.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ActionExample").getOrCreate() data = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6]) # Transformation filtered = data.filter(lambda x: x % 2 == 0) # Action: collect triggers execution and returns result result = filtered.collect() print(result) # Output: [2, 4, 6] spark.stop()
When to Use Which
Choose transformations when you want to build or modify your data processing steps without running them immediately. This lets Spark optimize the entire workflow before execution.
Choose actions when you need to get results, save data, or trigger the actual computation. Without actions, Spark will not process the data.
In practice, use transformations to prepare your data and actions to finalize and retrieve outputs.