How to Use One Hot Encoding with sklearn in Python
Use
OneHotEncoder from sklearn.preprocessing to convert categorical features into a one hot numeric array. Fit the encoder on your data with fit or fit_transform, then transform your data to get one hot encoded output.Syntax
The main class for one hot encoding in sklearn is OneHotEncoder. You create an encoder object, then use fit or fit_transform on your categorical data. Use transform to encode new data after fitting.
OneHotEncoder(): creates the encoder with options likesparse_output=Falseto get a dense array output.fit(X): learns categories from dataX.transform(X): converts dataXto one hot encoded format.fit_transform(X): fits and transforms in one step.
python
from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(sparse_output=False) encoder.fit(X) # X is 2D array of categorical data encoded = encoder.transform(X)
Example
This example shows how to one hot encode a list of color names using sklearn's OneHotEncoder. It fits the encoder and transforms the data into a numeric array where each column represents a color category.
python
from sklearn.preprocessing import OneHotEncoder import numpy as np # Sample categorical data (2D array) colors = np.array([['red'], ['green'], ['blue'], ['green'], ['red']]) # Create encoder with dense output encoder = OneHotEncoder(sparse_output=False) # Fit and transform the data encoded_colors = encoder.fit_transform(colors) print("Categories:", encoder.categories_) print("One hot encoded array:\n", encoded_colors)
Output
Categories: [array(['blue', 'green', 'red'], dtype=object)]
One hot encoded array:
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
Common Pitfalls
- Not reshaping 1D input arrays to 2D before encoding causes errors; sklearn expects 2D arrays.
- Using
sparse_output=Trueby default returns a sparse matrix, which may confuse if you expect a dense array. - Fitting the encoder on training data but forgetting to transform test data with the same encoder leads to inconsistent categories.
- Passing unseen categories in new data to
transformcauses errors unlesshandle_unknown='ignore'is set.
python
from sklearn.preprocessing import OneHotEncoder import numpy as np # Wrong: 1D array input colors_wrong = np.array(['red', 'green', 'blue']) encoder = OneHotEncoder(sparse_output=False) # This will raise an error: # encoder.fit_transform(colors_wrong) # ValueError # Correct: reshape to 2D colors_correct = colors_wrong.reshape(-1, 1) encoded = encoder.fit_transform(colors_correct) print(encoded)
Output
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]
Quick Reference
Remember these key points when using OneHotEncoder:
- Input must be 2D array: shape (n_samples, n_features).
- Use
sparse_output=Falsefor dense numpy arrays. - Fit on training data, then transform train and test data.
- Set
handle_unknown='ignore'to avoid errors on unseen categories.
Key Takeaways
Use sklearn's OneHotEncoder to convert categorical data into numeric one hot format.
Always provide 2D input arrays to OneHotEncoder, reshaping 1D arrays if needed.
Set sparse_output=False to get a dense numpy array output instead of a sparse matrix.
Fit the encoder on training data and use the same encoder to transform test data.
Handle unknown categories with handle_unknown='ignore' to avoid errors on new data.