Apache-sparkHow-ToBeginner · 3 min read

How to Use dropna in PySpark to Remove Null Values

In PySpark, use the dropna() method on a DataFrame to remove rows containing null or missing values. You can customize it by specifying parameters like how (to drop rows with any or all nulls) and subset (to focus on specific columns).

📐

Syntax

The dropna() method removes rows with null values from a DataFrame. It has these main parameters:

how: 'any' (default) drops rows with any nulls; 'all' drops rows only if all specified columns are null.
thresh: requires a minimum number of non-null values to keep the row.
subset: list of columns to check for nulls instead of all columns.

python

DataFrame.dropna(how='any', thresh=None, subset=None)

💻

Example

This example shows how to remove rows with any null values and how to drop rows only if all values in specified columns are null.

python

from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.master('local[*]').appName('DropNAExample').getOrCreate()

# Create example data with nulls
data = [
    Row(name='Alice', age=25, city='New York'),
    Row(name='Bob', age=None, city='Los Angeles'),
    Row(name=None, age=None, city=None),
    Row(name='David', age=30, city=None)
]

df = spark.createDataFrame(data)

# Drop rows with any null values
df_drop_any = df.dropna()

# Drop rows where all columns are null
df_drop_all = df.dropna(how='all')

# Drop rows with nulls only in 'age' and 'city'
df_drop_subset = df.dropna(subset=['age', 'city'])

print('Original DataFrame:')
df.show()

print('Drop rows with any nulls:')
df_drop_any.show()

print('Drop rows where all columns are null:')
df_drop_all.show()

print('Drop rows with nulls in age or city:')
df_drop_subset.show()

spark.stop()

Output

⚠️

Common Pitfalls

Common mistakes when using dropna() include:

Not specifying subset when you want to check nulls only in certain columns, which may remove more rows than intended.
Confusing how='any' and how='all', leading to unexpected row drops.
Forgetting that dropna() returns a new DataFrame and does not modify the original.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('DropNAExample').getOrCreate()

data = [(1, None), (None, 2), (None, None), (3, 4)]
columns = ['A', 'B']
df = spark.createDataFrame(data, columns)

# Wrong: expecting to drop rows with nulls only in column 'A' but not using subset
wrong_drop = df.dropna()

# Right: specify subset to drop rows with nulls in column 'A' only
right_drop = df.dropna(subset=['A'])

print('Original DataFrame:')
df.show()
print('Wrong drop (no subset):')
wrong_drop.show()
print('Right drop (subset=["A"]):')
right_drop.show()

spark.stop()

Output

Original DataFrame: +----+----+ | A| B| +----+----+ | 1|null| |null| 2| |null|null| | 3| 4| +----+----+ Wrong drop (no subset): +---+---+ | A| B| +---+---+ | 3| 4| +---+---+ Right drop (subset=["A"]): +----+----+ | A| B| +----+----+ | 1|null| | 3| 4| +----+----+

📊

Quick Reference

Summary of dropna() parameters:

Parameter	Description	Default
how	'any' drops rows with any nulls; 'all' drops rows only if all specified columns are null	'any'
thresh	Minimum number of non-null values required to keep the row	None
subset	List of columns to check for nulls instead of all columns	None (all columns)

✅

Key Takeaways

Use DataFrame.dropna() to remove rows with null values in PySpark.

Specify 'how' parameter as 'any' or 'all' to control drop behavior.

Use 'subset' to focus on specific columns when dropping nulls.

dropna() returns a new DataFrame; original is unchanged.

Be careful with parameters to avoid dropping unintended rows.