0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use flatMap in Spark RDD in PySpark

In PySpark, flatMap is a transformation on RDDs that applies a function to each element and flattens the results into a single list. It is useful when each input element maps to zero or more output elements, unlike map which keeps one-to-one mapping.
๐Ÿ“

Syntax

The flatMap function takes a function as input that returns an iterable (like a list) for each element in the RDD. It then flattens all these iterables into one RDD.

  • rdd.flatMap(func): Applies func to each element.
  • func: A function that returns a list or iterable for each input element.
python
flatMap(func)

# where func is a function that returns a list or iterable for each element
๐Ÿ’ป

Example

This example shows how to split sentences into words using flatMap. Each sentence is split into a list of words, and flatMap flattens all word lists into one RDD of words.

python
from pyspark import SparkContext

sc = SparkContext.getOrCreate()

sentences = ["hello world", "flatMap example", "spark rdd"]
rdd = sc.parallelize(sentences)

words_rdd = rdd.flatMap(lambda sentence: sentence.split(" "))

print(words_rdd.collect())
Output
['hello', 'world', 'flatMap', 'example', 'spark', 'rdd']
โš ๏ธ

Common Pitfalls

One common mistake is using map instead of flatMap when the function returns a list. map will create an RDD of lists, not flatten them.

Wrong usage example:

python
rdd = sc.parallelize(["hello world", "spark rdd"])

mapped = rdd.map(lambda x: x.split(" "))
print(mapped.collect())  # Output: [['hello', 'world'], ['spark', 'rdd']]

flat_mapped = rdd.flatMap(lambda x: x.split(" "))
print(flat_mapped.collect())  # Output: ['hello', 'world', 'spark', 'rdd']
Output
[['hello', 'world'], ['spark', 'rdd']] ['hello', 'world', 'spark', 'rdd']
๐Ÿ“Š

Quick Reference

MethodDescriptionReturns
flatMap(func)Applies func to each element and flattens the resultsNew RDD with flattened elements
map(func)Applies func to each element without flatteningNew RDD with elements as returned by func
funcFunction returning a list or iterable for each elementIterable (list, tuple, etc.)
โœ…

Key Takeaways

Use flatMap when your function returns multiple items per input element and you want a flat list.
flatMap flattens the results into a single RDD, unlike map which keeps nested structures.
Always ensure the function passed to flatMap returns an iterable like a list or tuple.
Common mistake: using map instead of flatMap leads to nested lists in the output.
flatMap is useful for splitting, expanding, or filtering elements in an RDD.