0
0
MlopsHow-ToBeginner · 3 min read

How to Use OneHotEncoder in sklearn with Python

Use OneHotEncoder from sklearn.preprocessing to convert categorical features into one-hot numeric arrays. Fit the encoder on your data with fit() or fit_transform(), then transform your data with transform() to get encoded output.
📐

Syntax

The basic syntax to use OneHotEncoder is:

  • OneHotEncoder(): Creates the encoder object.
  • fit(X): Learns the categories from data X.
  • transform(X): Converts data X into one-hot encoded format.
  • fit_transform(X): Combines fit and transform in one step.

You can customize behavior with parameters like handle_unknown to control unknown categories and sparse_output to choose output format.

python
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoder.fit(X)  # X is your categorical data
X_encoded = encoder.transform(X)
💻

Example

This example shows how to encode a small dataset with two categorical features using OneHotEncoder. It fits the encoder and transforms the data into a numeric array.

python
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample categorical data with two features
X = np.array([['red', 'S'], ['green', 'M'], ['blue', 'L'], ['green', 'XL']])

# Create encoder with sparse_output=False to get dense array output
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform the data
X_encoded = encoder.fit_transform(X)

print('Encoded array:')
print(X_encoded)

print('\nFeature categories:')
print(encoder.categories_)
Output
Encoded array: [[0. 0. 1. 1. 0. 0.] [0. 1. 0. 0. 1. 0.] [1. 0. 0. 0. 0. 1.] [0. 1. 0. 0. 0. 0.]] Feature categories: [array(['blue', 'green', 'red'], dtype=object), array(['L', 'M', 'S', 'XL'], dtype=object)]
⚠️

Common Pitfalls

Common mistakes when using OneHotEncoder include:

  • Not setting handle_unknown='ignore' when transforming new data with unseen categories, which causes errors.
  • Forgetting to set sparse_output=False if you want a dense numpy array instead of a sparse matrix.
  • Passing 1D arrays instead of 2D arrays; OneHotEncoder expects 2D input.
python
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Wrong: 1D input array
X_wrong = np.array(['red', 'green', 'blue'])
encoder = OneHotEncoder(sparse_output=False)

# This will raise an error:
# encoder.fit_transform(X_wrong)

# Correct: reshape to 2D
X_correct = X_wrong.reshape(-1, 1)
encoded = encoder.fit_transform(X_correct)
print(encoded)
Output
[[0. 0. 1.] [0. 1. 0.] [1. 0. 0.]]
📊

Quick Reference

ParameterDescriptionDefault
handle_unknownHow to handle categories not seen during fit ('error' or 'ignore')'error'
sparse_outputReturn sparse matrix if True, else dense arrayTrue
categoriesSpecify categories manually or 'auto' to learn from data'auto'
dropDrop one category per feature to avoid multicollinearityNone

Key Takeaways

Use OneHotEncoder to convert categorical features into numeric arrays for ML models.
Always provide 2D input arrays to OneHotEncoder, even for single features.
Set handle_unknown='ignore' to avoid errors with unseen categories during transform.
Set sparse_output=False if you want a dense numpy array output instead of a sparse matrix.
Use fit_transform() to fit and encode data in one step for convenience.