Apache Sparkdata~10 mins

Map, filter, and flatMap operations in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Map, filter, and flatMap operations

Start with RDD/DataFrame

↓

Map

↓

Transformed RDD/DataFrame

↓

Result

Start with data, then apply map to transform each item, filter to keep items by condition, or flatMap to transform and flatten lists, finally collect results.

Execution Sample

Apache Spark

rdd = sc.parallelize([1, 2, 3, 4])
map_rdd = rdd.map(lambda x: x * 2)
filter_rdd = map_rdd.filter(lambda x: x > 4)
flatmap_rdd = filter_rdd.flatMap(lambda x: [x, x+1])
result = flatmap_rdd.collect()

This code doubles numbers, keeps those greater than 4, then expands each to two numbers, and collects the final list.

Execution Table

Step	RDD Content	Operation	Transformation Detail	Resulting RDD Content
1	[1, 2, 3, 4]	Start	Initial RDD	[1, 2, 3, 4]
2	[1, 2, 3, 4]	map	Multiply each by 2	[2, 4, 6, 8]
3	[2, 4, 6, 8]	filter	Keep elements > 4	[6, 8]
4	[6, 8]	flatMap	For each x, create [x, x+1]	[6, 7, 8, 9]
5	[6, 7, 8, 9]	collect	Gather all elements to driver	[6, 7, 8, 9]

💡 After collect, data is gathered and processing stops.

Variable Tracker

Variable	Start	After map	After filter	After flatMap	After collect
rdd	[1, 2, 3, 4]	[1, 2, 3, 4]	[1, 2, 3, 4]	[1, 2, 3, 4]	[1, 2, 3, 4]
map_rdd		[2, 4, 6, 8]	[2, 4, 6, 8]	[2, 4, 6, 8]	[2, 4, 6, 8]
filter_rdd			[6, 8]	[6, 8]	[6, 8]
flatmap_rdd				[6, 7, 8, 9]	[6, 7, 8, 9]
result					[6, 7, 8, 9]

Key Moments - 3 Insights

Why does flatMap produce more elements than filter or map?

Does filter change the values of elements?

When is the actual computation done in Spark for these operations?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, what is the content of filter_rdd after step 3?

A[1, 2, 3, 4]

B[2, 4, 6, 8]

C[6, 8]

D[6, 7, 8, 9]

Concept Snapshot

Map, filter, and flatMap are Spark transformations.
Map changes each element.
Filter keeps elements by condition.
FlatMap changes and flattens lists.
Actions like collect trigger execution.

Full Transcript

We start with an RDD of numbers. Map doubles each number. Filter keeps only numbers greater than 4. FlatMap takes each number and creates a list with the number and the next number, then flattens all lists into one RDD. Finally, collect gathers all elements to the driver program. Map changes values, filter removes some elements, flatMap can increase the number of elements. Actual computation happens only when collect is called.