0
0
ML Pythonprogramming~5 mins

Handling missing values in ML Python

Choose your learning style9 modes available
Introduction

Missing values can cause errors or wrong results in machine learning. Handling them helps models learn better and make good predictions.

When your dataset has empty or unknown entries in some columns.
When you want to prepare data before training a model.
When you want to avoid errors caused by missing data during analysis.
When you want to keep as much data as possible without losing rows.
When you want to fill missing data with reasonable guesses.
Syntax
ML Python
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

SimpleImputer replaces missing values with a chosen strategy like mean, median, or most frequent value.

You must fit the imputer on training data, then transform both training and test data.

Examples
Replace missing values with the mean of each column.
ML Python
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
Replace missing values with the median of each column.
ML Python
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)
Replace missing values with the most common value in each column.
ML Python
imputer = SimpleImputer(strategy='most_frequent')
X_imputed = imputer.fit_transform(X)
Replace missing values with zero.
ML Python
imputer = SimpleImputer(strategy='constant', fill_value=0)
X_imputed = imputer.fit_transform(X)
Sample Program

This program shows how to replace missing values with the mean of each column using SimpleImputer.

ML Python
import numpy as np
from sklearn.impute import SimpleImputer

# Sample data with missing values (np.nan)
X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [np.nan, np.nan]])

# Create imputer to fill missing values with mean
imputer = SimpleImputer(strategy='mean')

# Fit imputer on data and transform
X_imputed = imputer.fit_transform(X)

print("Original data:\n", X)
print("\nData after imputing missing values with mean:\n", X_imputed)
OutputSuccess
Important Notes

Always fit the imputer only on training data to avoid data leakage.

Imputation strategies depend on the data type and distribution.

For categorical data, use 'most_frequent' or 'constant' strategies.

Summary

Missing values can cause problems in machine learning models.

SimpleImputer helps fill missing values with mean, median, or other strategies.

Always fit on training data and transform both training and test sets.