Concept Flow - Lazy evaluation in Spark
Define transformations
Build DAG (Directed Acyclic Graph)
Trigger action
Execute DAG
Return results
Spark waits to run computations until an action is called, building a plan (DAG) first, then executing it.
rdd = sc.parallelize([1, 2, 3, 4]) mapped = rdd.map(lambda x: x * 2) filtered = mapped.filter(lambda x: x > 4) result = filtered.collect()
| Step | Operation | Action Triggered? | DAG State | Execution | Output |
|---|---|---|---|---|---|
| 1 | Create RDD from list [1,2,3,4] | No | DAG with 1 node (source) | No execution | No output |
| 2 | Map: multiply each element by 2 | No | DAG with 2 nodes (source -> map) | No execution | No output |
| 3 | Filter: keep elements > 4 | No | DAG with 3 nodes (source -> map -> filter) | No execution | No output |
| 4 | Collect action called | Yes | DAG ready | Execute all transformations | [6, 8] |
| Variable | Start | After Step 1 | After Step 2 | After Step 3 | After Step 4 |
|---|---|---|---|---|---|
| rdd | undefined | RDD with data [1, 2, 3, 4] | RDD with data [1, 2, 3, 4] | RDD with data [1, 2, 3, 4] | RDD with data [1, 2, 3, 4] |
| mapped | undefined | undefined | RDD with map transformation | RDD with map transformation | RDD with map transformation |
| filtered | undefined | undefined | undefined | RDD with filter transformation | RDD with filter transformation |
| result | undefined | undefined | undefined | undefined | [6, 8] |
Lazy evaluation in Spark means transformations build a plan (DAG) but do not run immediately. Actions like collect() trigger execution of all transformations. This allows Spark to optimize and run efficiently. Transformations are 'lazy'; actions are 'eager'.