Apache-sparkConceptBeginner · 3 min read

What is Transformation in Spark: Definition and Examples

In Apache Spark, a transformation is an operation that creates a new dataset from an existing one without immediately executing it. Transformations are lazy, meaning Spark builds a plan of operations to run later when an action is called.

⚙️

How It Works

Think of transformations in Spark like writing down a recipe instead of cooking the meal right away. When you apply a transformation, Spark just notes what changes to make to the data but does not process it immediately. This is called lazy evaluation.

Later, when you ask Spark to produce a result (an action), it follows the recipe and performs all the transformations in one go. This approach saves time and resources by optimizing the steps before running them.

Examples of transformations include map, filter, and flatMap, which create new datasets by changing or filtering the original data.

💻

Example

This example shows how to use a transformation to filter numbers greater than 5 from a list in Spark.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('TransformationExample').getOrCreate()

# Create an RDD from a list
numbers = spark.sparkContext.parallelize([1, 3, 5, 7, 9])

# Apply a transformation to filter numbers greater than 5
filtered_numbers = numbers.filter(lambda x: x > 5)

# Action to collect and print the results
result = filtered_numbers.collect()
print(result)

spark.stop()

Output

[7, 9]

🎯

When to Use

Use transformations when you want to prepare or change your data step-by-step without running the process immediately. This is useful when working with large datasets because Spark can optimize all transformations together before running them.

Real-world uses include cleaning data by removing unwanted records, converting data formats, or extracting specific information before analysis or machine learning.

✅

Key Points

Transformations create new datasets from existing ones without immediate execution.
They are lazy and only run when an action is called.
Common transformations include map, filter, and flatMap.
They allow Spark to optimize data processing for better performance.

✅

Key Takeaways

Transformations in Spark define data changes but do not execute immediately.

They enable Spark to optimize processing by delaying execution until needed.

Use transformations to prepare and clean data efficiently before analysis.

Common transformations include map, filter, and flatMap.

Actions trigger the execution of all transformations in Spark.