0
0
Apache Sparkdata~10 mins

Map, filter, and flatMap operations in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Map, filter, and flatMap operations
Start with RDD/DataFrame
Map
Transformed RDD/DataFrame
Result
Start with data, then apply map to transform each item, filter to keep items by condition, or flatMap to transform and flatten lists, finally collect results.
Execution Sample
Apache Spark
rdd = sc.parallelize([1, 2, 3, 4])
map_rdd = rdd.map(lambda x: x * 2)
filter_rdd = map_rdd.filter(lambda x: x > 4)
flatmap_rdd = filter_rdd.flatMap(lambda x: [x, x+1])
result = flatmap_rdd.collect()
This code doubles numbers, keeps those greater than 4, then expands each to two numbers, and collects the final list.
Execution Table
StepRDD ContentOperationTransformation DetailResulting RDD Content
1[1, 2, 3, 4]StartInitial RDD[1, 2, 3, 4]
2[1, 2, 3, 4]mapMultiply each by 2[2, 4, 6, 8]
3[2, 4, 6, 8]filterKeep elements > 4[6, 8]
4[6, 8]flatMapFor each x, create [x, x+1][6, 7, 8, 9]
5[6, 7, 8, 9]collectGather all elements to driver[6, 7, 8, 9]
💡 After collect, data is gathered and processing stops.
Variable Tracker
VariableStartAfter mapAfter filterAfter flatMapAfter collect
rdd[1, 2, 3, 4][1, 2, 3, 4][1, 2, 3, 4][1, 2, 3, 4][1, 2, 3, 4]
map_rdd[2, 4, 6, 8][2, 4, 6, 8][2, 4, 6, 8][2, 4, 6, 8]
filter_rdd[6, 8][6, 8][6, 8]
flatmap_rdd[6, 7, 8, 9][6, 7, 8, 9]
result[6, 7, 8, 9]
Key Moments - 3 Insights
Why does flatMap produce more elements than filter or map?
Because flatMap takes each element and returns a list, then flattens all lists into one big list, increasing total elements (see execution_table step 4).
Does filter change the values of elements?
No, filter only keeps or removes elements based on a condition without changing their values (see execution_table step 3).
When is the actual computation done in Spark for these operations?
Computation happens only when an action like collect() is called, not during map, filter, or flatMap (see execution_table step 5).
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the content of filter_rdd after step 3?
A[1, 2, 3, 4]
B[2, 4, 6, 8]
C[6, 8]
D[6, 7, 8, 9]
💡 Hint
Check the 'Resulting RDD Content' column at step 3 in the execution_table.
At which step does the RDD first contain more elements than the previous step?
AStep 4 (flatMap)
BStep 2 (map)
CStep 3 (filter)
DStep 5 (collect)
💡 Hint
Look at the 'Resulting RDD Content' sizes in execution_table rows.
If the filter condition changed to x > 7, what would be the content of filter_rdd after step 3?
A[6, 8]
B[8]
C[2, 4, 6, 8]
D[]
💡 Hint
Filter keeps elements strictly greater than 7, check values after map in execution_table step 2.
Concept Snapshot
Map, filter, and flatMap are Spark transformations.
Map changes each element.
Filter keeps elements by condition.
FlatMap changes and flattens lists.
Actions like collect trigger execution.
Full Transcript
We start with an RDD of numbers. Map doubles each number. Filter keeps only numbers greater than 4. FlatMap takes each number and creates a list with the number and the next number, then flattens all lists into one RDD. Finally, collect gathers all elements to the driver program. Map changes values, filter removes some elements, flatMap can increase the number of elements. Actual computation happens only when collect is called.