How to Use where in PySpark: Syntax and Examples
In PySpark, you use
where to filter rows in a DataFrame based on a condition. It works like SQL's WHERE clause and accepts expressions or column conditions to return only matching rows.Syntax
The where method filters rows in a DataFrame based on a condition. You can pass a SQL expression as a string or a Column expression.
- df.where(condition): Returns a new DataFrame with rows that satisfy
condition. conditioncan be a string like"age > 30"or a Column expression likedf["age"] > 30.
python
filtered_df = df.where("age > 30") # or from pyspark.sql.functions import col filtered_df = df.where(col("age") > 30)
Example
This example shows how to create a PySpark DataFrame and use where to filter rows where the age is greater than 25.
python
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName("WhereExample").getOrCreate() # Create sample data data = [("Alice", 23), ("Bob", 30), ("Cathy", 27), ("David", 22)] columns = ["name", "age"] df = spark.createDataFrame(data, columns) # Filter rows where age > 25 filtered_df = df.where(col("age") > 25) filtered_df.show()
Output
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 27|
+-----+---+
Common Pitfalls
Common mistakes when using where include:
- Passing an invalid SQL expression string that causes errors.
- Using Python boolean operators like
andinstead of PySpark operators like&&or chaining conditions with&. - Not importing
colfrompyspark.sql.functionswhen using Column expressions.
python
from pyspark.sql.functions import col # Wrong: Using Python 'and' instead of '&' # filtered_df = df.where((col("age") > 25) and (col("age") < 30)) # This will error # Right: Use '&' and parentheses filtered_df = df.where((col("age") > 25) & (col("age") < 30)) filtered_df.show()
Output
+-----+---+
| name|age|
+-----+---+
|Cathy| 27|
+-----+---+
Quick Reference
Summary tips for using where in PySpark:
- Use
whereto filter DataFrame rows by conditions. - Conditions can be SQL strings or Column expressions.
- For multiple conditions, combine with
&(and) or|(or) operators. - Always import
colfor Column expressions.
Key Takeaways
Use df.where(condition) to filter rows in a PySpark DataFrame.
Conditions can be SQL strings or Column expressions using col().
Combine multiple conditions with & (and) or | (or), not Python and/or.
Import col from pyspark.sql.functions when using Column expressions.
where works like SQL WHERE clause to select matching rows.