How to Use dropna in PySpark to Remove Null Values
In PySpark, use the
dropna() method on a DataFrame to remove rows containing null or missing values. You can customize it by specifying parameters like how (to drop rows with any or all nulls) and subset (to focus on specific columns).Syntax
The dropna() method removes rows with null values from a DataFrame. It has these main parameters:
- how: 'any' (default) drops rows with any nulls; 'all' drops rows only if all specified columns are null.
- thresh: requires a minimum number of non-null values to keep the row.
- subset: list of columns to check for nulls instead of all columns.
python
DataFrame.dropna(how='any', thresh=None, subset=None)
Example
This example shows how to remove rows with any null values and how to drop rows only if all values in specified columns are null.
python
from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.master('local[*]').appName('DropNAExample').getOrCreate() # Create example data with nulls data = [ Row(name='Alice', age=25, city='New York'), Row(name='Bob', age=None, city='Los Angeles'), Row(name=None, age=None, city=None), Row(name='David', age=30, city=None) ] df = spark.createDataFrame(data) # Drop rows with any null values df_drop_any = df.dropna() # Drop rows where all columns are null df_drop_all = df.dropna(how='all') # Drop rows with nulls only in 'age' and 'city' df_drop_subset = df.dropna(subset=['age', 'city']) print('Original DataFrame:') df.show() print('Drop rows with any nulls:') df_drop_any.show() print('Drop rows where all columns are null:') df_drop_all.show() print('Drop rows with nulls in age or city:') df_drop_subset.show() spark.stop()
Output
Original DataFrame:
+-----+----+-----------+
| name| age| city|
+-----+----+-----------+
|Alice| 25| New York|
| Bob|null|Los Angeles|
| null|null| null|
|David| 30| null|
+-----+----+-----------+
Drop rows with any nulls:
+-----+---+-------+
| name|age| city|
+-----+---+-------+
|Alice| 25|New York|
+-----+---+-------+
Drop rows where all columns are null:
+-----+----+-----------+
| name| age| city|
+-----+----+-----------+
|Alice| 25| New York|
| Bob|null|Los Angeles|
|David| 30| null|
+-----+----+-----------+
Drop rows with nulls in age or city:
+-----+----+-----------+
| name| age| city|
+-----+----+-----------+
|Alice| 25| New York|
| Bob|null|Los Angeles|
+-----+----+-----------+
Common Pitfalls
Common mistakes when using dropna() include:
- Not specifying
subsetwhen you want to check nulls only in certain columns, which may remove more rows than intended. - Confusing
how='any'andhow='all', leading to unexpected row drops. - Forgetting that
dropna()returns a new DataFrame and does not modify the original.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local[*]').appName('DropNAExample').getOrCreate() data = [(1, None), (None, 2), (None, None), (3, 4)] columns = ['A', 'B'] df = spark.createDataFrame(data, columns) # Wrong: expecting to drop rows with nulls only in column 'A' but not using subset wrong_drop = df.dropna() # Right: specify subset to drop rows with nulls in column 'A' only right_drop = df.dropna(subset=['A']) print('Original DataFrame:') df.show() print('Wrong drop (no subset):') wrong_drop.show() print('Right drop (subset=["A"]):') right_drop.show() spark.stop()
Output
Original DataFrame:
+----+----+
| A| B|
+----+----+
| 1|null|
|null| 2|
|null|null|
| 3| 4|
+----+----+
Wrong drop (no subset):
+---+---+
| A| B|
+---+---+
| 3| 4|
+---+---+
Right drop (subset=["A"]):
+----+----+
| A| B|
+----+----+
| 1|null|
| 3| 4|
+----+----+
Quick Reference
Summary of dropna() parameters:
| Parameter | Description | Default |
|---|---|---|
| how | 'any' drops rows with any nulls; 'all' drops rows only if all specified columns are null | 'any' |
| thresh | Minimum number of non-null values required to keep the row | None |
| subset | List of columns to check for nulls instead of all columns | None (all columns) |
Key Takeaways
Use DataFrame.dropna() to remove rows with null values in PySpark.
Specify 'how' parameter as 'any' or 'all' to control drop behavior.
Use 'subset' to focus on specific columns when dropping nulls.
dropna() returns a new DataFrame; original is unchanged.
Be careful with parameters to avoid dropping unintended rows.