0
0
MlopsDebug / FixBeginner · 3 min read

How to Handle Categorical Variables in ML with Python sklearn

In Python's sklearn, categorical variables must be converted to numbers before training models. Use OneHotEncoder for nominal categories or OrdinalEncoder for ordinal categories to transform text labels into numeric arrays.
🔍

Why This Happens

Machine learning models in sklearn require numeric input. If you pass text labels directly, the model will raise an error because it cannot process strings.

python
from sklearn.linear_model import LogisticRegression

X = [["red"], ["blue"], ["green"]]
y = [0, 1, 0]

model = LogisticRegression()
model.fit(X, y)
Output
ValueError: could not convert string to float: 'red'
🔧

The Fix

Convert categorical text data to numbers using OneHotEncoder for categories without order or OrdinalEncoder for ordered categories. This encoding allows models to understand the data.

python
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

X = [["red"], ["blue"], ["green"], ["blue"]]
y = [0, 1, 0, 1]

encoder = OneHotEncoder(sparse_output=False)
model = LogisticRegression()
pipeline = make_pipeline(encoder, model)

pipeline.fit(X, y)
predictions = pipeline.predict([["green"], ["red"]])
print(predictions)
Output
[0 0]
🛡️

Prevention

Always check your data types before training. Use OneHotEncoder or OrdinalEncoder to transform categorical features. Integrate encoding in a pipeline to avoid data leakage and ensure consistent preprocessing.

⚠️

Related Errors

Common errors include:

  • ValueError: could not convert string to float - caused by unencoded categorical data.
  • DataConversionWarning - when mixing encoded and raw data.
  • Shape mismatch errors - when encoding is inconsistent between training and test sets.

Fix these by consistent encoding and using pipelines.

Key Takeaways

Machine learning models need numeric input; convert categorical variables before training.
Use OneHotEncoder for categories without order and OrdinalEncoder for ordered categories.
Integrate encoding in sklearn pipelines to keep preprocessing consistent and avoid errors.
Always check data types and encoding before fitting models to prevent conversion errors.