Transformations and actions help us work with big data in Spark. Transformations prepare data, and actions get results.
0
0
Transformations vs actions in Apache Spark
Introduction
When you want to change or filter data without running the job immediately.
When you want to count or collect data to see the final output.
When you want to chain multiple data steps before getting results.
When you want to save processed data to a file or database.
When you want to check if data meets a condition by running a quick test.
Syntax
Apache Spark
rdd_transformed = rdd_original.transformation() result = rdd_transformed.action()
Transformations create a new dataset from an existing one but do not run immediately.
Actions trigger the execution and return results or write data.
Examples
Filter is a transformation that selects data. Count is an action that returns the number of items.
Apache Spark
rdd_filtered = rdd.filter(lambda x: x > 10) # Transformation count = rdd_filtered.count() # Action
Map changes each item. Collect brings all data to the driver as a list.
Apache Spark
rdd_mapped = rdd.map(lambda x: x * 2) # Transformation collected = rdd_mapped.collect() # Action
Sample Program
This code creates a list of numbers, filters even numbers, multiplies them by 10, and then collects the results to print.
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local').appName('TransformationsVsActions').getOrCreate() rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6]) # Transformation: filter even numbers rdd_even = rdd.filter(lambda x: x % 2 == 0) # Transformation: multiply each by 10 rdd_times_ten = rdd_even.map(lambda x: x * 10) # Action: collect results result = rdd_times_ten.collect() print(result) spark.stop()
OutputSuccess
Important Notes
Transformations are lazy and only run when an action is called.
Actions return results to the driver or write data out.
Understanding this helps optimize Spark jobs and avoid unnecessary work.
Summary
Transformations prepare or change data but do not run immediately.
Actions run the job and return results or save data.
Use transformations to build your data steps, then use actions to get output.