What is Transformation in Spark: Definition and Examples
transformation is an operation that creates a new dataset from an existing one without immediately executing it. Transformations are lazy, meaning Spark builds a plan of operations to run later when an action is called.How It Works
Think of transformations in Spark like writing down a recipe instead of cooking the meal right away. When you apply a transformation, Spark just notes what changes to make to the data but does not process it immediately. This is called lazy evaluation.
Later, when you ask Spark to produce a result (an action), it follows the recipe and performs all the transformations in one go. This approach saves time and resources by optimizing the steps before running them.
Examples of transformations include map, filter, and flatMap, which create new datasets by changing or filtering the original data.
Example
This example shows how to use a transformation to filter numbers greater than 5 from a list in Spark.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('TransformationExample').getOrCreate() # Create an RDD from a list numbers = spark.sparkContext.parallelize([1, 3, 5, 7, 9]) # Apply a transformation to filter numbers greater than 5 filtered_numbers = numbers.filter(lambda x: x > 5) # Action to collect and print the results result = filtered_numbers.collect() print(result) spark.stop()
When to Use
Use transformations when you want to prepare or change your data step-by-step without running the process immediately. This is useful when working with large datasets because Spark can optimize all transformations together before running them.
Real-world uses include cleaning data by removing unwanted records, converting data formats, or extracting specific information before analysis or machine learning.
Key Points
- Transformations create new datasets from existing ones without immediate execution.
- They are lazy and only run when an action is called.
- Common transformations include
map,filter, andflatMap. - They allow Spark to optimize data processing for better performance.