0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use dropna in PySpark to Remove Null Values

In PySpark, use the dropna() method on a DataFrame to remove rows containing null or missing values. You can customize it by specifying parameters like how (to drop rows with any or all nulls) and subset (to focus on specific columns).
๐Ÿ“

Syntax

The dropna() method removes rows with null values from a DataFrame. It has these main parameters:

  • how: 'any' (default) drops rows with any nulls; 'all' drops rows only if all specified columns are null.
  • thresh: requires a minimum number of non-null values to keep the row.
  • subset: list of columns to check for nulls instead of all columns.
python
DataFrame.dropna(how='any', thresh=None, subset=None)
๐Ÿ’ป

Example

This example shows how to remove rows with any null values and how to drop rows only if all values in specified columns are null.

python
from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.master('local[*]').appName('DropNAExample').getOrCreate()

# Create example data with nulls
data = [
    Row(name='Alice', age=25, city='New York'),
    Row(name='Bob', age=None, city='Los Angeles'),
    Row(name=None, age=None, city=None),
    Row(name='David', age=30, city=None)
]

df = spark.createDataFrame(data)

# Drop rows with any null values
df_drop_any = df.dropna()

# Drop rows where all columns are null
df_drop_all = df.dropna(how='all')

# Drop rows with nulls only in 'age' and 'city'
df_drop_subset = df.dropna(subset=['age', 'city'])

print('Original DataFrame:')
df.show()

print('Drop rows with any nulls:')
df_drop_any.show()

print('Drop rows where all columns are null:')
df_drop_all.show()

print('Drop rows with nulls in age or city:')
df_drop_subset.show()

spark.stop()
Output
Original DataFrame: +-----+----+-----------+ | name| age| city| +-----+----+-----------+ |Alice| 25| New York| | Bob|null|Los Angeles| | null|null| null| |David| 30| null| +-----+----+-----------+ Drop rows with any nulls: +-----+---+-------+ | name|age| city| +-----+---+-------+ |Alice| 25|New York| +-----+---+-------+ Drop rows where all columns are null: +-----+----+-----------+ | name| age| city| +-----+----+-----------+ |Alice| 25| New York| | Bob|null|Los Angeles| |David| 30| null| +-----+----+-----------+ Drop rows with nulls in age or city: +-----+----+-----------+ | name| age| city| +-----+----+-----------+ |Alice| 25| New York| | Bob|null|Los Angeles| +-----+----+-----------+
โš ๏ธ

Common Pitfalls

Common mistakes when using dropna() include:

  • Not specifying subset when you want to check nulls only in certain columns, which may remove more rows than intended.
  • Confusing how='any' and how='all', leading to unexpected row drops.
  • Forgetting that dropna() returns a new DataFrame and does not modify the original.
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('DropNAExample').getOrCreate()

data = [(1, None), (None, 2), (None, None), (3, 4)]
columns = ['A', 'B']
df = spark.createDataFrame(data, columns)

# Wrong: expecting to drop rows with nulls only in column 'A' but not using subset
wrong_drop = df.dropna()

# Right: specify subset to drop rows with nulls in column 'A' only
right_drop = df.dropna(subset=['A'])

print('Original DataFrame:')
df.show()
print('Wrong drop (no subset):')
wrong_drop.show()
print('Right drop (subset=["A"]):')
right_drop.show()

spark.stop()
Output
Original DataFrame: +----+----+ | A| B| +----+----+ | 1|null| |null| 2| |null|null| | 3| 4| +----+----+ Wrong drop (no subset): +---+---+ | A| B| +---+---+ | 3| 4| +---+---+ Right drop (subset=["A"]): +----+----+ | A| B| +----+----+ | 1|null| | 3| 4| +----+----+
๐Ÿ“Š

Quick Reference

Summary of dropna() parameters:

ParameterDescriptionDefault
how'any' drops rows with any nulls; 'all' drops rows only if all specified columns are null'any'
threshMinimum number of non-null values required to keep the rowNone
subsetList of columns to check for nulls instead of all columnsNone (all columns)
โœ…

Key Takeaways

Use DataFrame.dropna() to remove rows with null values in PySpark.
Specify 'how' parameter as 'any' or 'all' to control drop behavior.
Use 'subset' to focus on specific columns when dropping nulls.
dropna() returns a new DataFrame; original is unchanged.
Be careful with parameters to avoid dropping unintended rows.