How to use OneHotEncoder sklearn in python

MlopsHow-ToBeginner · 3 min read

How to Use OneHotEncoder in sklearn with Python

Use OneHotEncoder from sklearn.preprocessing to convert categorical features into one-hot numeric arrays. Fit the encoder on your data with fit() or fit_transform(), then transform your data with transform() to get encoded output.

📐

Syntax

The basic syntax to use OneHotEncoder is:

OneHotEncoder(): Creates the encoder object.
fit(X): Learns the categories from data X.
transform(X): Converts data X into one-hot encoded format.
fit_transform(X): Combines fit and transform in one step.

You can customize behavior with parameters like handle_unknown to control unknown categories and sparse_output to choose output format.

python

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoder.fit(X)  # X is your categorical data
X_encoded = encoder.transform(X)

💻

Example

This example shows how to encode a small dataset with two categorical features using OneHotEncoder. It fits the encoder and transforms the data into a numeric array.

python

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample categorical data with two features
X = np.array([['red', 'S'], ['green', 'M'], ['blue', 'L'], ['green', 'XL']])

# Create encoder with sparse_output=False to get dense array output
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform the data
X_encoded = encoder.fit_transform(X)

print('Encoded array:')
print(X_encoded)

print('\nFeature categories:')
print(encoder.categories_)

Output

Encoded array: [[0. 0. 1. 1. 0. 0.] [0. 1. 0. 0. 1. 0.] [1. 0. 0. 0. 0. 1.] [0. 1. 0. 0. 0. 0.]] Feature categories: [array(['blue', 'green', 'red'], dtype=object), array(['L', 'M', 'S', 'XL'], dtype=object)]

⚠️

Common Pitfalls

Common mistakes when using OneHotEncoder include:

Not setting handle_unknown='ignore' when transforming new data with unseen categories, which causes errors.
Forgetting to set sparse_output=False if you want a dense numpy array instead of a sparse matrix.
Passing 1D arrays instead of 2D arrays; OneHotEncoder expects 2D input.

python

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Wrong: 1D input array
X_wrong = np.array(['red', 'green', 'blue'])
encoder = OneHotEncoder(sparse_output=False)

# This will raise an error:
# encoder.fit_transform(X_wrong)

# Correct: reshape to 2D
X_correct = X_wrong.reshape(-1, 1)
encoded = encoder.fit_transform(X_correct)
print(encoded)

Output

[[0. 0. 1.] [0. 1. 0.] [1. 0. 0.]]

📊

Quick Reference

Parameter	Description	Default
handle_unknown	How to handle categories not seen during fit ('error' or 'ignore')	'error'
sparse_output	Return sparse matrix if True, else dense array	True
categories	Specify categories manually or 'auto' to learn from data	'auto'
drop	Drop one category per feature to avoid multicollinearity	None

✅

Key Takeaways

Use OneHotEncoder to convert categorical features into numeric arrays for ML models.

Always provide 2D input arrays to OneHotEncoder, even for single features.

Set handle_unknown='ignore' to avoid errors with unseen categories during transform.

Set sparse_output=False if you want a dense numpy array output instead of a sparse matrix.

Use fit_transform() to fit and encode data in one step for convenience.