Apache Sparkdata~10 mins

Transformations vs actions in Apache Spark - Visual Side-by-Side Comparison

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Transformations vs actions

Start with RDD/DataFrame

↓

Apply Transformation

↓

Lazy Evaluation: No execution yet

↓

Apply Action

↓

Trigger Execution

↓

Return Result / Output

Transformations create new datasets lazily; actions trigger computation and return results.

Execution Sample

Apache Spark

rdd = sc.parallelize([1, 2, 3, 4])
mapped = rdd.map(lambda x: x * 2)
count = mapped.count()

Create an RDD, transform it by doubling each element, then count elements (action triggers execution).

Execution Table

Step	Operation	Type	Execution Triggered?	Result/Output
1	Create RDD with [1,2,3,4]	Transformation (initial)	No	[1,2,3,4] (lazy)
2	Apply map(x*2)	Transformation	No	[2,4,6,8] (lazy)
3	Call count()	Action	Yes	4

💡 count() is an action, triggers execution and returns the number of elements.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3
rdd	None	[1,2,3,4]	[1,2,3,4]	[1,2,3,4]
mapped	None	None	[2,4,6,8]	[2,4,6,8]
count	None	None	None	4

Key Moments - 3 Insights

Why doesn't the map transformation run immediately after it's called?

What triggers the actual computation in Spark?

Does the variable 'mapped' hold the transformed data right after map()?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, at which step does Spark actually process the data?

AStep 1

BStep 3

CStep 2

DNone of the above

Concept Snapshot

Transformations create new datasets lazily without running computations.
Actions trigger Spark to execute the transformations and return results.
Transformations are like a recipe; actions are when you cook.
Common actions: count(), collect(), take().
Common transformations: map(), filter(), flatMap().

Full Transcript

In Apache Spark, transformations and actions behave differently. Transformations like map() or filter() do not run immediately; they just build a plan to process data later. This is called lazy evaluation. Actions like count() or collect() trigger Spark to run all the transformations and produce results. For example, creating an RDD and applying map() does not process data yet. Only when count() is called does Spark execute the plan and return the number of elements. Variables holding transformations store plans, not actual data, until an action runs. Understanding this helps optimize Spark jobs and avoid unnecessary computations.