How to use one hot encoding sklearn in python

MlopsHow-ToBeginner · 3 min read

How to Use One Hot Encoding with sklearn in Python

Use OneHotEncoder from sklearn.preprocessing to convert categorical features into a one hot numeric array. Fit the encoder on your data with fit or fit_transform, then transform your data to get one hot encoded output.

📐

Syntax

The main class for one hot encoding in sklearn is OneHotEncoder. You create an encoder object, then use fit or fit_transform on your categorical data. Use transform to encode new data after fitting.

OneHotEncoder(): creates the encoder with options like sparse_output=False to get a dense array output.
fit(X): learns categories from data X.
transform(X): converts data X to one hot encoded format.
fit_transform(X): fits and transforms in one step.

python

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoder.fit(X)  # X is 2D array of categorical data
encoded = encoder.transform(X)

💻

Example

This example shows how to one hot encode a list of color names using sklearn's OneHotEncoder. It fits the encoder and transforms the data into a numeric array where each column represents a color category.

python

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample categorical data (2D array)
colors = np.array([['red'], ['green'], ['blue'], ['green'], ['red']])

# Create encoder with dense output
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the data
encoded_colors = encoder.fit_transform(colors)

print("Categories:", encoder.categories_)
print("One hot encoded array:\n", encoded_colors)

Output

Categories: [array(['blue', 'green', 'red'], dtype=object)] One hot encoded array: [[0. 0. 1.] [0. 1. 0.] [1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]

⚠️

Common Pitfalls

Not reshaping 1D input arrays to 2D before encoding causes errors; sklearn expects 2D arrays.
Using sparse_output=True by default returns a sparse matrix, which may confuse if you expect a dense array.
Fitting the encoder on training data but forgetting to transform test data with the same encoder leads to inconsistent categories.
Passing unseen categories in new data to transform causes errors unless handle_unknown='ignore' is set.

python

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Wrong: 1D array input
colors_wrong = np.array(['red', 'green', 'blue'])
encoder = OneHotEncoder(sparse_output=False)

# This will raise an error:
# encoder.fit_transform(colors_wrong)  # ValueError

# Correct: reshape to 2D
colors_correct = colors_wrong.reshape(-1, 1)
encoded = encoder.fit_transform(colors_correct)
print(encoded)

Output

[[0. 0. 1.] [0. 1. 0.] [1. 0. 0.]]

📊

Quick Reference

Remember these key points when using OneHotEncoder:

Input must be 2D array: shape (n_samples, n_features).
Use sparse_output=False for dense numpy arrays.
Fit on training data, then transform train and test data.
Set handle_unknown='ignore' to avoid errors on unseen categories.

✅

Key Takeaways

Use sklearn's OneHotEncoder to convert categorical data into numeric one hot format.

Always provide 2D input arrays to OneHotEncoder, reshaping 1D arrays if needed.

Set sparse_output=False to get a dense numpy array output instead of a sparse matrix.

Fit the encoder on training data and use the same encoder to transform test data.

Handle unknown categories with handle_unknown='ignore' to avoid errors on new data.