Missing values can cause errors or wrong results in machine learning. Handling them helps models learn better and make good predictions.
0
0
Handling missing values in ML Python
Introduction
When your dataset has empty or unknown entries in some columns.
When you want to prepare data before training a model.
When you want to avoid errors caused by missing data during analysis.
When you want to keep as much data as possible without losing rows.
When you want to fill missing data with reasonable guesses.
Syntax
ML Python
from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='mean') X_imputed = imputer.fit_transform(X)
SimpleImputer replaces missing values with a chosen strategy like mean, median, or most frequent value.
You must fit the imputer on training data, then transform both training and test data.
Examples
Replace missing values with the mean of each column.
ML Python
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)Replace missing values with the median of each column.
ML Python
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)Replace missing values with the most common value in each column.
ML Python
imputer = SimpleImputer(strategy='most_frequent')
X_imputed = imputer.fit_transform(X)Replace missing values with zero.
ML Python
imputer = SimpleImputer(strategy='constant', fill_value=0) X_imputed = imputer.fit_transform(X)
Sample Program
This program shows how to replace missing values with the mean of each column using SimpleImputer.
ML Python
import numpy as np from sklearn.impute import SimpleImputer # Sample data with missing values (np.nan) X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [np.nan, np.nan]]) # Create imputer to fill missing values with mean imputer = SimpleImputer(strategy='mean') # Fit imputer on data and transform X_imputed = imputer.fit_transform(X) print("Original data:\n", X) print("\nData after imputing missing values with mean:\n", X_imputed)
OutputSuccess
Important Notes
Always fit the imputer only on training data to avoid data leakage.
Imputation strategies depend on the data type and distribution.
For categorical data, use 'most_frequent' or 'constant' strategies.
Summary
Missing values can cause problems in machine learning models.
SimpleImputer helps fill missing values with mean, median, or other strategies.
Always fit on training data and transform both training and test sets.