How to Filter Rows in PySpark: Simple Syntax and Examples
In PySpark, you can filter rows of a DataFrame using the
filter() or where() methods with a condition inside. These methods keep only the rows that match the condition you specify.Syntax
Use filter(condition) or where(condition) on a PySpark DataFrame to keep rows that meet the condition. The condition is usually a comparison or logical expression on columns.
df.filter(df.column > value): keeps rows where column is greater than value.df.where(df.column == 'text'): keeps rows where column equals 'text'.
python
filtered_df = df.filter(df['age'] > 30) filtered_df = df.where(df['name'] == 'Alice')
Example
This example creates a simple DataFrame and filters rows where the age is greater than 25. It shows how to use filter() and prints the filtered rows.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('FilterExample').getOrCreate() data = [ (1, 'Alice', 29), (2, 'Bob', 23), (3, 'Cathy', 31), (4, 'David', 19) ] columns = ['id', 'name', 'age'] df = spark.createDataFrame(data, columns) # Filter rows where age > 25 filtered_df = df.filter(df['age'] > 25) filtered_df.show() spark.stop()
Output
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 29|
| 3|Cathy| 31|
+---+-----+---+
Common Pitfalls
One common mistake is using Python's == or and operators directly instead of PySpark column expressions. Always use PySpark column operators like == for equality and & for AND.
Wrong: df.filter(df['age'] > 25 and df['name'] == 'Alice') (this causes an error)
Right: df.filter((df['age'] > 25) & (df['name'] == 'Alice'))
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('PitfallExample').getOrCreate() data = [(1, 'Alice', 29), (2, 'Bob', 23)] columns = ['id', 'name', 'age'] df = spark.createDataFrame(data, columns) # Wrong way (will raise error): # filtered_df = df.filter(df['age'] > 25 and df['name'] == 'Alice') # Right way: filtered_df = df.filter((df['age'] > 25) & (df['name'] == 'Alice')) filtered_df.show() spark.stop()
Output
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 29|
+---+-----+---+
Quick Reference
| Method | Description | Example |
|---|---|---|
| filter(condition) | Filters rows matching condition | df.filter(df['age'] > 30) |
| where(condition) | Same as filter, filters rows | df.where(df['name'] == 'Bob') |
| & (and) | Logical AND for multiple conditions | df.filter((df['age'] > 20) & (df['name'] != 'Alice')) |
| | (or) | Logical OR for multiple conditions | df.filter((df['age'] < 20) | (df['name'] == 'Bob')) |
Key Takeaways
Use df.filter(condition) or df.where(condition) to keep rows matching the condition.
Write conditions using PySpark column expressions, not Python operators.
Combine multiple conditions with & (and) and | (or), using parentheses.
filter() and where() methods work the same way in PySpark.
Always test your filter conditions to avoid syntax errors.