How to filter rows pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Filter Rows in PySpark: Simple Syntax and Examples

In PySpark, you can filter rows of a DataFrame using the filter() or where() methods with a condition inside. These methods keep only the rows that match the condition you specify.

📐

Syntax

Use filter(condition) or where(condition) on a PySpark DataFrame to keep rows that meet the condition. The condition is usually a comparison or logical expression on columns.

df.filter(df.column > value): keeps rows where column is greater than value.
df.where(df.column == 'text'): keeps rows where column equals 'text'.

python

filtered_df = df.filter(df['age'] > 30)
filtered_df = df.where(df['name'] == 'Alice')

💻

Example

This example creates a simple DataFrame and filters rows where the age is greater than 25. It shows how to use filter() and prints the filtered rows.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('FilterExample').getOrCreate()
data = [
    (1, 'Alice', 29),
    (2, 'Bob', 23),
    (3, 'Cathy', 31),
    (4, 'David', 19)
]
columns = ['id', 'name', 'age']
df = spark.createDataFrame(data, columns)

# Filter rows where age > 25
filtered_df = df.filter(df['age'] > 25)
filtered_df.show()

spark.stop()

Output

+---+-----+---+ | id| name|age| +---+-----+---+ | 1|Alice| 29| | 3|Cathy| 31| +---+-----+---+

⚠️

Common Pitfalls

One common mistake is using Python's == or and operators directly instead of PySpark column expressions. Always use PySpark column operators like == for equality and & for AND.

Wrong: df.filter(df['age'] > 25 and df['name'] == 'Alice') (this causes an error)

Right: df.filter((df['age'] > 25) & (df['name'] == 'Alice'))

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PitfallExample').getOrCreate()
data = [(1, 'Alice', 29), (2, 'Bob', 23)]
columns = ['id', 'name', 'age']
df = spark.createDataFrame(data, columns)

# Wrong way (will raise error):
# filtered_df = df.filter(df['age'] > 25 and df['name'] == 'Alice')

# Right way:
filtered_df = df.filter((df['age'] > 25) & (df['name'] == 'Alice'))
filtered_df.show()

spark.stop()

Output

+---+-----+---+ | id| name|age| +---+-----+---+ | 1|Alice| 29| +---+-----+---+

📊

Quick Reference

Method	Description	Example
filter(condition)	Filters rows matching condition	df.filter(df['age'] > 30)
where(condition)	Same as filter, filters rows	df.where(df['name'] == 'Bob')
& (and)	Logical AND for multiple conditions	df.filter((df['age'] > 20) & (df['name'] != 'Alice'))
\| (or)	Logical OR for multiple conditions	df.filter((df['age'] < 20) \| (df['name'] == 'Bob'))

✅

Key Takeaways

Use df.filter(condition) or df.where(condition) to keep rows matching the condition.

Write conditions using PySpark column expressions, not Python operators.

Combine multiple conditions with & (and) and | (or), using parentheses.

filter() and where() methods work the same way in PySpark.

Always test your filter conditions to avoid syntax errors.