How to Handle Missing Values in Python with sklearn
sklearn.impute.SimpleImputer which replaces missing data with a strategy like mean or median. This helps avoid errors during model training caused by NaN values.Why This Happens
Missing values occur when some data points are not recorded or lost. Many machine learning models in Python cannot work with NaN values and will throw errors if they encounter them.
from sklearn.linear_model import LogisticRegression import numpy as np X = np.array([[1, 2], [np.nan, 3], [7, 6]]) y = np.array([0, 1, 0]) model = LogisticRegression() model.fit(X, y)
The Fix
Use SimpleImputer from sklearn.impute to replace missing values with a number like the mean of the column. This cleans the data so the model can train without errors.
from sklearn.impute import SimpleImputer from sklearn.linear_model import LogisticRegression import numpy as np X = np.array([[1, 2], [np.nan, 3], [7, 6]]) y = np.array([0, 1, 0]) imputer = SimpleImputer(strategy='mean') X_imputed = imputer.fit_transform(X) model = LogisticRegression() model.fit(X_imputed, y) predictions = model.predict(X_imputed) print(predictions)
Prevention
Always check your data for missing values before training models using pandas.DataFrame.isnull() or numpy.isnan(). Use imputers or drop missing rows early to avoid errors. Automate this step in your data pipeline.
Related Errors
Errors like ValueError: could not convert string to float can happen if non-numeric data is mixed with missing values. Fix by encoding categorical data before imputing. Also, TypeError can occur if imputers get wrong data types.