How to Use fillna in PySpark to Handle Missing Data
fillna() on a DataFrame to replace missing values (nulls) with a specified value or values. You can pass a single value to replace all nulls or a dictionary to replace nulls in specific columns with different values.Syntax
The fillna() method is called on a PySpark DataFrame. It accepts either a single value or a dictionary mapping column names to replacement values.
df.fillna(value): Replace all nulls in the DataFrame withvalue.df.fillna({'col1': val1, 'col2': val2}): Replace nulls in specified columns with different values.
DataFrame.fillna(value: Union[Any, Dict[str, Any]], subset: Optional[List[str]] = None) -> DataFrame
Example
This example shows how to replace null values in a PySpark DataFrame using fillna(). It replaces all nulls with a single value and also replaces nulls in specific columns with different values.
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.master('local[*]').appName('FillNAExample').getOrCreate() data = [ (1, None, 20.0), (2, 'blue', None), (3, None, None), (4, 'green', 40.0) ] columns = ['id', 'color', 'price'] df = spark.createDataFrame(data, columns) # Replace all nulls with a single value filled_all = df.fillna('missing') # Replace nulls in specific columns with different values filled_cols = df.fillna({'color': 'unknown', 'price': 0.0}) print('Original DataFrame:') df.show() print('Fill all nulls with "missing":') filled_all.show() print('Fill nulls in specific columns:') filled_cols.show() spark.stop()
Common Pitfalls
One common mistake is trying to use fillna() on columns that do not contain nulls or using incompatible data types for replacement values. Also, forgetting to specify columns when you want to replace nulls only in some columns can lead to unexpected replacements.
Another pitfall is using fillna() on non-nullable columns or columns with complex types, which may not behave as expected.
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local[*]').appName('FillNAExample').getOrCreate() data = [(1, None), (2, 'blue')] columns = ['id', 'color'] df = spark.createDataFrame(data, columns) # Wrong: Using fillna with a dict but column name typo # This will not replace nulls in 'color' because of wrong key wrong_fill = df.fillna({'colour': 'unknown'}) # Right: Use correct column name right_fill = df.fillna({'color': 'unknown'}) print('Wrong fillna usage:') wrong_fill.show() print('Correct fillna usage:') right_fill.show() spark.stop()
Quick Reference
Use this quick guide to remember how to use fillna() in PySpark:
| Usage | Description |
|---|---|
| df.fillna(value) | Replace all nulls in DataFrame with a single value |
| df.fillna({'col1': val1, 'col2': val2}) | Replace nulls in specified columns with different values |
| df.fillna(value, subset=['col1', 'col2']) | Replace nulls only in listed columns with a single value |
| Value types | Can be string, number, or boolean depending on column type |