How to Use OneHotEncoder in sklearn with Python
Use
OneHotEncoder from sklearn.preprocessing to convert categorical features into one-hot numeric arrays. Fit the encoder on your data with fit() or fit_transform(), then transform your data with transform() to get encoded output.Syntax
The basic syntax to use OneHotEncoder is:
OneHotEncoder(): Creates the encoder object.fit(X): Learns the categories from dataX.transform(X): Converts dataXinto one-hot encoded format.fit_transform(X): Combines fit and transform in one step.
You can customize behavior with parameters like handle_unknown to control unknown categories and sparse_output to choose output format.
python
from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) encoder.fit(X) # X is your categorical data X_encoded = encoder.transform(X)
Example
This example shows how to encode a small dataset with two categorical features using OneHotEncoder. It fits the encoder and transforms the data into a numeric array.
python
from sklearn.preprocessing import OneHotEncoder import numpy as np # Sample categorical data with two features X = np.array([['red', 'S'], ['green', 'M'], ['blue', 'L'], ['green', 'XL']]) # Create encoder with sparse_output=False to get dense array output encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # Fit and transform the data X_encoded = encoder.fit_transform(X) print('Encoded array:') print(X_encoded) print('\nFeature categories:') print(encoder.categories_)
Output
Encoded array:
[[0. 0. 1. 1. 0. 0.]
[0. 1. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 1.]
[0. 1. 0. 0. 0. 0.]]
Feature categories:
[array(['blue', 'green', 'red'], dtype=object), array(['L', 'M', 'S', 'XL'], dtype=object)]
Common Pitfalls
Common mistakes when using OneHotEncoder include:
- Not setting
handle_unknown='ignore'when transforming new data with unseen categories, which causes errors. - Forgetting to set
sparse_output=Falseif you want a dense numpy array instead of a sparse matrix. - Passing 1D arrays instead of 2D arrays;
OneHotEncoderexpects 2D input.
python
from sklearn.preprocessing import OneHotEncoder import numpy as np # Wrong: 1D input array X_wrong = np.array(['red', 'green', 'blue']) encoder = OneHotEncoder(sparse_output=False) # This will raise an error: # encoder.fit_transform(X_wrong) # Correct: reshape to 2D X_correct = X_wrong.reshape(-1, 1) encoded = encoder.fit_transform(X_correct) print(encoded)
Output
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| handle_unknown | How to handle categories not seen during fit ('error' or 'ignore') | 'error' |
| sparse_output | Return sparse matrix if True, else dense array | True |
| categories | Specify categories manually or 'auto' to learn from data | 'auto' |
| drop | Drop one category per feature to avoid multicollinearity | None |
Key Takeaways
Use OneHotEncoder to convert categorical features into numeric arrays for ML models.
Always provide 2D input arrays to OneHotEncoder, even for single features.
Set handle_unknown='ignore' to avoid errors with unseen categories during transform.
Set sparse_output=False if you want a dense numpy array output instead of a sparse matrix.
Use fit_transform() to fit and encode data in one step for convenience.