0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Filter Rows in PySpark: Simple Syntax and Examples

In PySpark, you can filter rows of a DataFrame using the filter() or where() methods with a condition inside. These methods keep only the rows that match the condition you specify.
๐Ÿ“

Syntax

Use filter(condition) or where(condition) on a PySpark DataFrame to keep rows that meet the condition. The condition is usually a comparison or logical expression on columns.

  • df.filter(df.column > value): keeps rows where column is greater than value.
  • df.where(df.column == 'text'): keeps rows where column equals 'text'.
python
filtered_df = df.filter(df['age'] > 30)
filtered_df = df.where(df['name'] == 'Alice')
๐Ÿ’ป

Example

This example creates a simple DataFrame and filters rows where the age is greater than 25. It shows how to use filter() and prints the filtered rows.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('FilterExample').getOrCreate()
data = [
    (1, 'Alice', 29),
    (2, 'Bob', 23),
    (3, 'Cathy', 31),
    (4, 'David', 19)
]
columns = ['id', 'name', 'age']
df = spark.createDataFrame(data, columns)

# Filter rows where age > 25
filtered_df = df.filter(df['age'] > 25)
filtered_df.show()

spark.stop()
Output
+---+-----+---+ | id| name|age| +---+-----+---+ | 1|Alice| 29| | 3|Cathy| 31| +---+-----+---+
โš ๏ธ

Common Pitfalls

One common mistake is using Python's == or and operators directly instead of PySpark column expressions. Always use PySpark column operators like == for equality and & for AND.

Wrong: df.filter(df['age'] > 25 and df['name'] == 'Alice') (this causes an error)

Right: df.filter((df['age'] > 25) & (df['name'] == 'Alice'))

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PitfallExample').getOrCreate()
data = [(1, 'Alice', 29), (2, 'Bob', 23)]
columns = ['id', 'name', 'age']
df = spark.createDataFrame(data, columns)

# Wrong way (will raise error):
# filtered_df = df.filter(df['age'] > 25 and df['name'] == 'Alice')

# Right way:
filtered_df = df.filter((df['age'] > 25) & (df['name'] == 'Alice'))
filtered_df.show()

spark.stop()
Output
+---+-----+---+ | id| name|age| +---+-----+---+ | 1|Alice| 29| +---+-----+---+
๐Ÿ“Š

Quick Reference

MethodDescriptionExample
filter(condition)Filters rows matching conditiondf.filter(df['age'] > 30)
where(condition)Same as filter, filters rowsdf.where(df['name'] == 'Bob')
& (and)Logical AND for multiple conditionsdf.filter((df['age'] > 20) & (df['name'] != 'Alice'))
| (or)Logical OR for multiple conditionsdf.filter((df['age'] < 20) | (df['name'] == 'Bob'))
โœ…

Key Takeaways

Use df.filter(condition) or df.where(condition) to keep rows matching the condition.
Write conditions using PySpark column expressions, not Python operators.
Combine multiple conditions with & (and) and | (or), using parentheses.
filter() and where() methods work the same way in PySpark.
Always test your filter conditions to avoid syntax errors.