Apache-sparkHow-ToBeginner · 4 min read

How to Use fillna in PySpark to Handle Missing Data

In PySpark, you use fillna() on a DataFrame to replace missing values (nulls) with a specified value or values. You can pass a single value to replace all nulls or a dictionary to replace nulls in specific columns with different values.

📐

Syntax

The fillna() method is called on a PySpark DataFrame. It accepts either a single value or a dictionary mapping column names to replacement values.

df.fillna(value): Replace all nulls in the DataFrame with value.
df.fillna({'col1': val1, 'col2': val2}): Replace nulls in specified columns with different values.

python

DataFrame.fillna(value: Union[Any, Dict[str, Any]], subset: Optional[List[str]] = None) -> DataFrame

💻

Example

This example shows how to replace null values in a PySpark DataFrame using fillna(). It replaces all nulls with a single value and also replaces nulls in specific columns with different values.

python

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.master('local[*]').appName('FillNAExample').getOrCreate()

data = [
    (1, None, 20.0),
    (2, 'blue', None),
    (3, None, None),
    (4, 'green', 40.0)
]

columns = ['id', 'color', 'price']

df = spark.createDataFrame(data, columns)

# Replace all nulls with a single value
filled_all = df.fillna('missing')

# Replace nulls in specific columns with different values
filled_cols = df.fillna({'color': 'unknown', 'price': 0.0})

print('Original DataFrame:')
df.show()

print('Fill all nulls with "missing":')
filled_all.show()

print('Fill nulls in specific columns:')
filled_cols.show()

spark.stop()

Output

+---+-----+-----+ | id|color|price| +---+-----+-----+ | 1| null| 20.0| | 2| blue| null| | 3| null| null| | 4|green| 40.0| +---+-----+-----+ Fill all nulls with "missing": +---+-------+-------+ | id| color| price| +---+-------+-------+ | 1|missing| 20.0| | 2| blue|missing| | 3|missing|missing| | 4| green| 40.0| +---+-------+-------+ Fill nulls in specific columns: +---+-------+-----+ | id| color|price| +---+-------+-----+ | 1|unknown| 20.0| | 2| blue| 0.0| | 3|unknown| 0.0| | 4| green| 40.0| +---+-------+-----+

⚠️

Common Pitfalls

One common mistake is trying to use fillna() on columns that do not contain nulls or using incompatible data types for replacement values. Also, forgetting to specify columns when you want to replace nulls only in some columns can lead to unexpected replacements.

Another pitfall is using fillna() on non-nullable columns or columns with complex types, which may not behave as expected.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('FillNAExample').getOrCreate()

data = [(1, None), (2, 'blue')]
columns = ['id', 'color']

df = spark.createDataFrame(data, columns)

# Wrong: Using fillna with a dict but column name typo
# This will not replace nulls in 'color' because of wrong key
wrong_fill = df.fillna({'colour': 'unknown'})

# Right: Use correct column name
right_fill = df.fillna({'color': 'unknown'})

print('Wrong fillna usage:')
wrong_fill.show()

print('Correct fillna usage:')
right_fill.show()

spark.stop()

Output

Wrong fillna usage: +---+-----+ | id|color| +---+-----+ | 1| null| | 2| blue| +---+-----+ Correct fillna usage: +---+-------+ | id| color| +---+-------+ | 1|unknown| | 2| blue| +---+-------+

📊

Quick Reference

Use this quick guide to remember how to use fillna() in PySpark:

Usage	Description
df.fillna(value)	Replace all nulls in DataFrame with a single value
df.fillna({'col1': val1, 'col2': val2})	Replace nulls in specified columns with different values
df.fillna(value, subset=['col1', 'col2'])	Replace nulls only in listed columns with a single value
Value types	Can be string, number, or boolean depending on column type

✅

Key Takeaways

Use fillna() on a DataFrame to replace null values with a constant or column-specific values.

Pass a dictionary to fillna() to replace nulls differently in each column.

Always check column names and data types to avoid silent failures.

You can limit fillna() to specific columns using the subset parameter.

fillna() does not modify the original DataFrame; it returns a new one.