How to Use flatMap in Spark RDD in PySpark
In PySpark,
flatMap is a transformation on RDDs that applies a function to each element and flattens the results into a single list. It is useful when each input element maps to zero or more output elements, unlike map which keeps one-to-one mapping.Syntax
The flatMap function takes a function as input that returns an iterable (like a list) for each element in the RDD. It then flattens all these iterables into one RDD.
- rdd.flatMap(func): Applies
functo each element. - func: A function that returns a list or iterable for each input element.
python
flatMap(func)
# where func is a function that returns a list or iterable for each elementExample
This example shows how to split sentences into words using flatMap. Each sentence is split into a list of words, and flatMap flattens all word lists into one RDD of words.
python
from pyspark import SparkContext sc = SparkContext.getOrCreate() sentences = ["hello world", "flatMap example", "spark rdd"] rdd = sc.parallelize(sentences) words_rdd = rdd.flatMap(lambda sentence: sentence.split(" ")) print(words_rdd.collect())
Output
['hello', 'world', 'flatMap', 'example', 'spark', 'rdd']
Common Pitfalls
One common mistake is using map instead of flatMap when the function returns a list. map will create an RDD of lists, not flatten them.
Wrong usage example:
python
rdd = sc.parallelize(["hello world", "spark rdd"]) mapped = rdd.map(lambda x: x.split(" ")) print(mapped.collect()) # Output: [['hello', 'world'], ['spark', 'rdd']] flat_mapped = rdd.flatMap(lambda x: x.split(" ")) print(flat_mapped.collect()) # Output: ['hello', 'world', 'spark', 'rdd']
Output
[['hello', 'world'], ['spark', 'rdd']]
['hello', 'world', 'spark', 'rdd']
Quick Reference
| Method | Description | Returns |
|---|---|---|
| flatMap(func) | Applies func to each element and flattens the results | New RDD with flattened elements |
| map(func) | Applies func to each element without flattening | New RDD with elements as returned by func |
| func | Function returning a list or iterable for each element | Iterable (list, tuple, etc.) |
Key Takeaways
Use flatMap when your function returns multiple items per input element and you want a flat list.
flatMap flattens the results into a single RDD, unlike map which keeps nested structures.
Always ensure the function passed to flatMap returns an iterable like a list or tuple.
Common mistake: using map instead of flatMap leads to nested lists in the output.
flatMap is useful for splitting, expanding, or filtering elements in an RDD.