val rdd = sc.parallelize(Seq(1, 2, 3, 4)) val mapped = rdd.map(x => x * 2) // No action called yet mapped.count()
Remember that transformations are lazy and actions trigger computation.
The map is a transformation and does not execute immediately. The count() is an action that triggers the computation and returns the number of elements, which is 4.
collect() action?val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5)) val filtered = rdd.filter(_ % 2 == 0) val mapped = filtered.map(_ * 10) mapped.collect()
Filter keeps even numbers, then map multiplies by 10.
The filter keeps 2 and 4, then map multiplies them by 10 resulting in 20 and 40.
val rdd = sc.parallelize(Seq(1, 2, 3)) rdd.map(_ * 2)
Think about what triggers Spark jobs to run.
Transformations like map are lazy and do not execute until an action is called.
Recall which operations cause Spark to run jobs.
Transformations build a computation plan and are lazy. Actions trigger execution and return results.
val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5)) val mapped = rdd.map(_ * 2) val filtered = mapped.filter(_ > 5) val countResult = filtered.count() val collectResult = filtered.collect()
Think about how caching affects repeated actions on the same RDD.
Caching the filtered RDD stores its data in memory, so the second action does not recompute the transformations again, reducing Spark jobs.