MlopsDebug / FixBeginner · 3 min read

How to Handle Missing Values in Python with sklearn

In Python, you can handle missing values using sklearn.impute.SimpleImputer which replaces missing data with a strategy like mean or median. This helps avoid errors during model training caused by NaN values.

🔍

Why This Happens

Missing values occur when some data points are not recorded or lost. Many machine learning models in Python cannot work with NaN values and will throw errors if they encounter them.

python

from sklearn.linear_model import LogisticRegression
import numpy as np

X = np.array([[1, 2], [np.nan, 3], [7, 6]])
y = np.array([0, 1, 0])

model = LogisticRegression()
model.fit(X, y)

Output

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

🔧

The Fix

Use SimpleImputer from sklearn.impute to replace missing values with a number like the mean of the column. This cleans the data so the model can train without errors.

python

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import numpy as np

X = np.array([[1, 2], [np.nan, 3], [7, 6]])
y = np.array([0, 1, 0])

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

model = LogisticRegression()
model.fit(X_imputed, y)

predictions = model.predict(X_imputed)
print(predictions)

Output

[0 1 0]

🛡️

Prevention

Always check your data for missing values before training models using pandas.DataFrame.isnull() or numpy.isnan(). Use imputers or drop missing rows early to avoid errors. Automate this step in your data pipeline.

⚠️

Related Errors

Errors like ValueError: could not convert string to float can happen if non-numeric data is mixed with missing values. Fix by encoding categorical data before imputing. Also, TypeError can occur if imputers get wrong data types.

✅

Key Takeaways

Use sklearn's SimpleImputer to replace missing values before model training.

Check your data for missing values early to prevent errors.

Choose an imputation strategy like mean, median, or most frequent based on your data.

Automate missing value handling in your data preprocessing pipeline.

Be aware of related errors caused by data type issues or mixed data.