0
0
Apache Sparkdata~5 mins

Map, filter, and flatMap operations in Apache Spark

Choose your learning style9 modes available
Introduction

These operations help you change, select, or expand data easily in big datasets.

You want to change each item in a list to a new form.
You want to keep only items that meet a rule.
You want to turn each item into many items and combine them all.
You want to clean data by removing unwanted parts.
You want to prepare data for analysis by reshaping it.
Syntax
Apache Spark
rdd.map(function)
rdd.filter(function)
rdd.flatMap(function)

map changes each item one-to-one.

filter keeps items that return true.

flatMap can return zero or more items per input, then flattens them.

Examples
Doubles each number in the RDD.
Apache Spark
rdd.map(lambda x: x * 2)
Keeps only even numbers.
Apache Spark
rdd.filter(lambda x: x % 2 == 0)
For each number, creates two numbers: the original and ten times bigger.
Apache Spark
rdd.flatMap(lambda x: [x, x * 10])
Sample Program

This code shows how to use map, filter, and flatMap on a small list of numbers. It first multiplies each number by 3, then keeps only those bigger than 7, and finally for each number, creates two numbers: itself and one more.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName('MapFilterFlatMap').getOrCreate()
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

# Map: multiply each number by 3
mapped = rdd.map(lambda x: x * 3)

# Filter: keep numbers greater than 7
filtered = mapped.filter(lambda x: x > 7)

# FlatMap: for each number, create number and number+1
flat_mapped = filtered.flatMap(lambda x: [x, x + 1])

print('Original:', rdd.collect())
print('Mapped (x3):', mapped.collect())
print('Filtered (>7):', filtered.collect())
print('FlatMapped (number and number+1):', flat_mapped.collect())

spark.stop()
OutputSuccess
Important Notes

Remember that map keeps the same number of items.

filter can reduce the number of items.

flatMap can increase or decrease items depending on the function.

Summary

map changes each item one by one.

filter selects items based on a rule.

flatMap turns each item into many items and flattens the result.