0
0
Apache Sparkdata~10 mins

Transformations vs actions in Apache Spark - Visual Side-by-Side Comparison

Choose your learning style9 modes available
Concept Flow - Transformations vs actions
Start with RDD/DataFrame
Apply Transformation
Lazy Evaluation: No execution yet
Apply Action
Trigger Execution
Return Result / Output
Transformations create new datasets lazily; actions trigger computation and return results.
Execution Sample
Apache Spark
rdd = sc.parallelize([1, 2, 3, 4])
mapped = rdd.map(lambda x: x * 2)
count = mapped.count()
Create an RDD, transform it by doubling each element, then count elements (action triggers execution).
Execution Table
StepOperationTypeExecution Triggered?Result/Output
1Create RDD with [1,2,3,4]Transformation (initial)No[1,2,3,4] (lazy)
2Apply map(x*2)TransformationNo[2,4,6,8] (lazy)
3Call count()ActionYes4
💡 count() is an action, triggers execution and returns the number of elements.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3
rddNone[1,2,3,4][1,2,3,4][1,2,3,4]
mappedNoneNone[2,4,6,8][2,4,6,8]
countNoneNoneNone4
Key Moments - 3 Insights
Why doesn't the map transformation run immediately after it's called?
Because transformations are lazy in Spark, they only build a plan. Execution happens only when an action like count() is called (see execution_table step 2 vs 3).
What triggers the actual computation in Spark?
Actions trigger computation. In the example, count() is the action that causes Spark to process the data (execution_table step 3).
Does the variable 'mapped' hold the transformed data right after map()?
No, 'mapped' holds a reference to the transformation plan, not the computed data. Actual data is computed only after an action (variable_tracker shows 'mapped' after step 2 is lazy).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, at which step does Spark actually process the data?
AStep 1
BStep 3
CStep 2
DNone of the above
💡 Hint
Check the 'Execution Triggered?' column in the execution_table.
According to the variable tracker, what is the value of 'count' after step 2?
ANone
B4
C[2,4,6,8]
D[1,2,3,4]
💡 Hint
Look at the 'count' row in variable_tracker after step 2.
If we replaced count() with map(lambda x: x+1), what would happen?
AData would be processed immediately
BAn error would occur
CNo execution would happen yet
DThe count would be returned
💡 Hint
Remember only actions trigger execution, transformations like map() do not (see concept_flow).
Concept Snapshot
Transformations create new datasets lazily without running computations.
Actions trigger Spark to execute the transformations and return results.
Transformations are like a recipe; actions are when you cook.
Common actions: count(), collect(), take().
Common transformations: map(), filter(), flatMap().
Full Transcript
In Apache Spark, transformations and actions behave differently. Transformations like map() or filter() do not run immediately; they just build a plan to process data later. This is called lazy evaluation. Actions like count() or collect() trigger Spark to run all the transformations and produce results. For example, creating an RDD and applying map() does not process data yet. Only when count() is called does Spark execute the plan and return the number of elements. Variables holding transformations store plans, not actual data, until an action runs. Understanding this helps optimize Spark jobs and avoid unnecessary computations.