0
0
Apache-sparkHow-ToBeginner ยท 4 min read

How to Use fillna in PySpark to Handle Missing Data

In PySpark, you use fillna() on a DataFrame to replace missing values (nulls) with a specified value or values. You can pass a single value to replace all nulls or a dictionary to replace nulls in specific columns with different values.
๐Ÿ“

Syntax

The fillna() method is called on a PySpark DataFrame. It accepts either a single value or a dictionary mapping column names to replacement values.

  • df.fillna(value): Replace all nulls in the DataFrame with value.
  • df.fillna({'col1': val1, 'col2': val2}): Replace nulls in specified columns with different values.
python
DataFrame.fillna(value: Union[Any, Dict[str, Any]], subset: Optional[List[str]] = None) -> DataFrame
๐Ÿ’ป

Example

This example shows how to replace null values in a PySpark DataFrame using fillna(). It replaces all nulls with a single value and also replaces nulls in specific columns with different values.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.master('local[*]').appName('FillNAExample').getOrCreate()

data = [
    (1, None, 20.0),
    (2, 'blue', None),
    (3, None, None),
    (4, 'green', 40.0)
]

columns = ['id', 'color', 'price']

df = spark.createDataFrame(data, columns)

# Replace all nulls with a single value
filled_all = df.fillna('missing')

# Replace nulls in specific columns with different values
filled_cols = df.fillna({'color': 'unknown', 'price': 0.0})

print('Original DataFrame:')
df.show()

print('Fill all nulls with "missing":')
filled_all.show()

print('Fill nulls in specific columns:')
filled_cols.show()

spark.stop()
Output
+---+-----+-----+ | id|color|price| +---+-----+-----+ | 1| null| 20.0| | 2| blue| null| | 3| null| null| | 4|green| 40.0| +---+-----+-----+ Fill all nulls with "missing": +---+-------+-------+ | id| color| price| +---+-------+-------+ | 1|missing| 20.0| | 2| blue|missing| | 3|missing|missing| | 4| green| 40.0| +---+-------+-------+ Fill nulls in specific columns: +---+-------+-----+ | id| color|price| +---+-------+-----+ | 1|unknown| 20.0| | 2| blue| 0.0| | 3|unknown| 0.0| | 4| green| 40.0| +---+-------+-----+
โš ๏ธ

Common Pitfalls

One common mistake is trying to use fillna() on columns that do not contain nulls or using incompatible data types for replacement values. Also, forgetting to specify columns when you want to replace nulls only in some columns can lead to unexpected replacements.

Another pitfall is using fillna() on non-nullable columns or columns with complex types, which may not behave as expected.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('FillNAExample').getOrCreate()

data = [(1, None), (2, 'blue')]
columns = ['id', 'color']

df = spark.createDataFrame(data, columns)

# Wrong: Using fillna with a dict but column name typo
# This will not replace nulls in 'color' because of wrong key
wrong_fill = df.fillna({'colour': 'unknown'})

# Right: Use correct column name
right_fill = df.fillna({'color': 'unknown'})

print('Wrong fillna usage:')
wrong_fill.show()

print('Correct fillna usage:')
right_fill.show()

spark.stop()
Output
Wrong fillna usage: +---+-----+ | id|color| +---+-----+ | 1| null| | 2| blue| +---+-----+ Correct fillna usage: +---+-------+ | id| color| +---+-------+ | 1|unknown| | 2| blue| +---+-------+
๐Ÿ“Š

Quick Reference

Use this quick guide to remember how to use fillna() in PySpark:

UsageDescription
df.fillna(value)Replace all nulls in DataFrame with a single value
df.fillna({'col1': val1, 'col2': val2})Replace nulls in specified columns with different values
df.fillna(value, subset=['col1', 'col2'])Replace nulls only in listed columns with a single value
Value typesCan be string, number, or boolean depending on column type
โœ…

Key Takeaways

Use fillna() on a DataFrame to replace null values with a constant or column-specific values.
Pass a dictionary to fillna() to replace nulls differently in each column.
Always check column names and data types to avoid silent failures.
You can limit fillna() to specific columns using the subset parameter.
fillna() does not modify the original DataFrame; it returns a new one.