0
0
NLPml~5 mins

One-hot encoding for text in NLP

Choose your learning style9 modes available
Introduction

One-hot encoding turns words into simple lists of zeros and ones. This helps computers understand text by showing which words appear.

When you want to convert words into numbers for a machine learning model.
When you have a small set of words and want a simple way to represent them.
When you need to prepare text data for basic classification tasks.
When you want to check if certain words appear in a sentence.
When you want to compare texts by their word presence.
Syntax
NLP
from sklearn.preprocessing import OneHotEncoder

# Create encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform text data
encoded = encoder.fit_transform(text_data)

Input text data must be reshaped to 2D array like [['word1'], ['word2'], ...]

Output is a 2D array where each row is a one-hot vector for a word.

Examples
This example encodes a list of words. Each unique word gets a position with 1, others 0.
NLP
from sklearn.preprocessing import OneHotEncoder

words = [['cat'], ['dog'], ['cat'], ['bird']]
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(words)
print(encoded)
Shows the unique words found by the encoder.
NLP
from sklearn.preprocessing import OneHotEncoder

words = [['apple'], ['banana'], ['apple'], ['orange']]
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(words)
print(encoder.categories_)
Sample Model

This program shows how to convert a list of words into one-hot vectors. It prints the unique words and their encoded forms.

NLP
from sklearn.preprocessing import OneHotEncoder

# List of words to encode
words = [['hello'], ['world'], ['hello'], ['machine'], ['learning']]

# Create the encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the words
encoded_words = encoder.fit_transform(words)

# Print the unique words found
print('Unique words:', encoder.categories_)

# Print the one-hot encoded vectors
print('One-hot encoded vectors:')
for word, vector in zip(words, encoded_words):
    print(f'{word[0]}: {vector}')
OutputSuccess
Important Notes

One-hot encoding creates very sparse vectors if you have many unique words.

For large text data, other methods like word embeddings are more efficient.

Always reshape your text data to 2D before using OneHotEncoder.

Summary

One-hot encoding turns words into simple vectors with one '1' and rest '0's.

It helps machine learning models understand which words appear.

Best for small sets of words or simple text tasks.