0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use where in PySpark: Syntax and Examples

In PySpark, you use where to filter rows in a DataFrame based on a condition. It works like SQL's WHERE clause and accepts expressions or column conditions to return only matching rows.
๐Ÿ“

Syntax

The where method filters rows in a DataFrame based on a condition. You can pass a SQL expression as a string or a Column expression.

  • df.where(condition): Returns a new DataFrame with rows that satisfy condition.
  • condition can be a string like "age > 30" or a Column expression like df["age"] > 30.
python
filtered_df = df.where("age > 30")
# or
from pyspark.sql.functions import col
filtered_df = df.where(col("age") > 30)
๐Ÿ’ป

Example

This example shows how to create a PySpark DataFrame and use where to filter rows where the age is greater than 25.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("WhereExample").getOrCreate()

# Create sample data
data = [("Alice", 23), ("Bob", 30), ("Cathy", 27), ("David", 22)]
columns = ["name", "age"]

df = spark.createDataFrame(data, columns)

# Filter rows where age > 25
filtered_df = df.where(col("age") > 25)

filtered_df.show()
Output
+-----+---+ | name|age| +-----+---+ | Bob| 30| |Cathy| 27| +-----+---+
โš ๏ธ

Common Pitfalls

Common mistakes when using where include:

  • Passing an invalid SQL expression string that causes errors.
  • Using Python boolean operators like and instead of PySpark operators like && or chaining conditions with &.
  • Not importing col from pyspark.sql.functions when using Column expressions.
python
from pyspark.sql.functions import col

# Wrong: Using Python 'and' instead of '&'
# filtered_df = df.where((col("age") > 25) and (col("age") < 30))  # This will error

# Right: Use '&' and parentheses
filtered_df = df.where((col("age") > 25) & (col("age") < 30))
filtered_df.show()
Output
+-----+---+ | name|age| +-----+---+ |Cathy| 27| +-----+---+
๐Ÿ“Š

Quick Reference

Summary tips for using where in PySpark:

  • Use where to filter DataFrame rows by conditions.
  • Conditions can be SQL strings or Column expressions.
  • For multiple conditions, combine with & (and) or | (or) operators.
  • Always import col for Column expressions.
โœ…

Key Takeaways

Use df.where(condition) to filter rows in a PySpark DataFrame.
Conditions can be SQL strings or Column expressions using col().
Combine multiple conditions with & (and) or | (or), not Python and/or.
Import col from pyspark.sql.functions when using Column expressions.
where works like SQL WHERE clause to select matching rows.