0
0
Pandasdata~5 mins

Missing data strategies decision in Pandas

Choose your learning style9 modes available
Introduction

Sometimes data is incomplete or missing. We need to decide how to handle these gaps to keep our analysis accurate and useful.

When you find empty cells in a survey dataset.
When sensor data has gaps due to technical issues.
When customer information is partially missing in a sales database.
When cleaning data before building a machine learning model.
When summarizing data and missing values could affect results.
Syntax
Pandas
df.dropna()
df.fillna(value)
df.isna()
df.notna()

dropna() removes rows or columns with missing data.

fillna() replaces missing data with a value you choose.

Examples
Removes all rows that have any missing values.
Pandas
df.dropna()
Replaces all missing values with zero.
Pandas
df.fillna(0)
Fills missing ages with the average age.
Pandas
df['Age'].fillna(df['Age'].mean())
Removes columns that have any missing values.
Pandas
df.dropna(axis=1)
Sample Program

This code shows three ways to handle missing data: removing rows with missing values, filling missing values with zero, and filling missing ages with the average age.

Pandas
import pandas as pd

# Create a sample data frame with missing values
 data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
         'Age': [25, None, 30, None],
         'Score': [85, 90, None, 88]}

df = pd.DataFrame(data)

print('Original DataFrame:')
print(df)

# Strategy 1: Drop rows with any missing data
cleaned_drop = df.dropna()
print('\nDataFrame after dropna():')
print(cleaned_drop)

# Strategy 2: Fill missing values with a constant
filled_zero = df.fillna(0)
print('\nDataFrame after fillna(0):')
print(filled_zero)

# Strategy 3: Fill missing Age with mean age
mean_age = df['Age'].mean()
filled_mean = df.copy()
filled_mean['Age'] = filled_mean['Age'].fillna(mean_age)
print(f'\nMean Age: {mean_age}')
print('DataFrame after filling Age with mean:')
print(filled_mean)
OutputSuccess
Important Notes

Dropping rows can reduce data size and may lose important information.

Filling missing values keeps data size but may introduce bias if not chosen carefully.

Always consider the context and why data is missing before choosing a strategy.

Summary

Missing data can be handled by dropping or filling values.

Use dropna() to remove missing data rows or columns.

Use fillna() to replace missing data with meaningful values.