0
0
Apache Sparkdata~5 mins

Transformations vs actions in Apache Spark

Choose your learning style9 modes available
Introduction

Transformations and actions help us work with big data in Spark. Transformations prepare data, and actions get results.

When you want to change or filter data without running the job immediately.
When you want to count or collect data to see the final output.
When you want to chain multiple data steps before getting results.
When you want to save processed data to a file or database.
When you want to check if data meets a condition by running a quick test.
Syntax
Apache Spark
rdd_transformed = rdd_original.transformation()
result = rdd_transformed.action()

Transformations create a new dataset from an existing one but do not run immediately.

Actions trigger the execution and return results or write data.

Examples
Filter is a transformation that selects data. Count is an action that returns the number of items.
Apache Spark
rdd_filtered = rdd.filter(lambda x: x > 10)  # Transformation
count = rdd_filtered.count()  # Action
Map changes each item. Collect brings all data to the driver as a list.
Apache Spark
rdd_mapped = rdd.map(lambda x: x * 2)  # Transformation
collected = rdd_mapped.collect()  # Action
Sample Program

This code creates a list of numbers, filters even numbers, multiplies them by 10, and then collects the results to print.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName('TransformationsVsActions').getOrCreate()
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6])

# Transformation: filter even numbers
rdd_even = rdd.filter(lambda x: x % 2 == 0)

# Transformation: multiply each by 10
rdd_times_ten = rdd_even.map(lambda x: x * 10)

# Action: collect results
result = rdd_times_ten.collect()

print(result)

spark.stop()
OutputSuccess
Important Notes

Transformations are lazy and only run when an action is called.

Actions return results to the driver or write data out.

Understanding this helps optimize Spark jobs and avoid unnecessary work.

Summary

Transformations prepare or change data but do not run immediately.

Actions run the job and return results or save data.

Use transformations to build your data steps, then use actions to get output.