How to Use Filter in Spark RDD in PySpark: Simple Guide
In PySpark, you use
filter() on an RDD to keep only elements that satisfy a condition. You pass a function to filter() that returns True for elements to keep and False for those to remove.Syntax
The filter() method is called on an RDD and takes one argument: a function that returns a boolean value. This function decides which elements to keep.
rdd.filter(lambda x: condition): keeps elements whereconditionisTrue.- The function can be a lambda or a named function.
python
filtered_rdd = rdd.filter(lambda x: x > 10)
Example
This example creates an RDD of numbers and filters out only those greater than 10.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local').appName('FilterExample').getOrCreate() rdd = spark.sparkContext.parallelize([5, 12, 7, 20, 3, 15]) filtered_rdd = rdd.filter(lambda x: x > 10) result = filtered_rdd.collect() print(result) spark.stop()
Output
[12, 20, 15]
Common Pitfalls
Common mistakes when using filter() include:
- Not returning a boolean from the filter function.
- Using actions like
collect()before filtering, which wastes resources. - Forgetting that
filter()is lazy and requires an action to execute.
python
wrong_filter = rdd.filter(lambda x: x) # This keeps all non-zero values, but can be confusing correct_filter = rdd.filter(lambda x: x > 10) # Clear condition returning boolean
Quick Reference
| Method | Description | Example |
|---|---|---|
| filter(function) | Keeps elements where function returns True | rdd.filter(lambda x: x % 2 == 0) |
| collect() | Returns all elements to driver as list | rdd.collect() |
| count() | Returns number of elements | rdd.count() |
Key Takeaways
Use filter() with a function returning True to keep elements in an RDD.
filter() is lazy; use an action like collect() to see results.
Always ensure the filter function returns a boolean value.
Avoid calling actions before filtering to save computation.
You can use lambda or named functions with filter().