Apache-sparkHow-ToBeginner · 3 min read

How to Use Filter in Spark RDD in PySpark: Simple Guide

In PySpark, you use filter() on an RDD to keep only elements that satisfy a condition. You pass a function to filter() that returns True for elements to keep and False for those to remove.

📐

Syntax

The filter() method is called on an RDD and takes one argument: a function that returns a boolean value. This function decides which elements to keep.

rdd.filter(lambda x: condition): keeps elements where condition is True.
The function can be a lambda or a named function.

python

filtered_rdd = rdd.filter(lambda x: x > 10)

💻

Example

This example creates an RDD of numbers and filters out only those greater than 10.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName('FilterExample').getOrCreate()
rdd = spark.sparkContext.parallelize([5, 12, 7, 20, 3, 15])
filtered_rdd = rdd.filter(lambda x: x > 10)
result = filtered_rdd.collect()
print(result)
spark.stop()

Output

[12, 20, 15]

⚠️

Common Pitfalls

Common mistakes when using filter() include:

Not returning a boolean from the filter function.
Using actions like collect() before filtering, which wastes resources.
Forgetting that filter() is lazy and requires an action to execute.

python

wrong_filter = rdd.filter(lambda x: x)  # This keeps all non-zero values, but can be confusing
correct_filter = rdd.filter(lambda x: x > 10)  # Clear condition returning boolean

📊

Quick Reference

Method	Description	Example
filter(function)	Keeps elements where function returns True	rdd.filter(lambda x: x % 2 == 0)
collect()	Returns all elements to driver as list	rdd.collect()
count()	Returns number of elements	rdd.count()

✅

Key Takeaways

Use filter() with a function returning True to keep elements in an RDD.

filter() is lazy; use an action like collect() to see results.

Always ensure the filter function returns a boolean value.

Avoid calling actions before filtering to save computation.

You can use lambda or named functions with filter().