0
0
NLPml~15 mins

One-hot encoding for text in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - One-hot encoding for text
Problem:You want to convert a list of text sentences into one-hot encoded vectors to prepare data for a machine learning model.
Current Metrics:N/A - currently, the text is raw and not encoded, so the model cannot train.
Issue:The text data is not in a numeric format that machine learning models can understand. Without encoding, the model cannot learn from the text.
Your Task
Convert the given list of sentences into one-hot encoded vectors using a vocabulary built from the text. Verify the encoding by printing the one-hot vectors.
Use only Python standard libraries and scikit-learn.
Do not use embedding layers or other complex encodings.
Keep the vocabulary size manageable by using the unique words from the input sentences.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
sentences = [
    "I love machine learning",
    "Machine learning is fun",
    "I love coding",
    "Coding is fun"
]

# Create CountVectorizer with binary=True for one-hot encoding
vectorizer = CountVectorizer(binary=True)

# Fit on sentences to build vocabulary
vectorizer.fit(sentences)

# Transform sentences to one-hot encoded vectors
one_hot_vectors = vectorizer.transform(sentences)

# Convert sparse matrix to array for display
one_hot_array = one_hot_vectors.toarray()

# Print vocabulary and one-hot vectors
print("Vocabulary:", vectorizer.get_feature_names_out())
print("One-hot encoded vectors:")
for sentence, vector in zip(sentences, one_hot_array):
    print(f"{sentence}: {vector}")
Used CountVectorizer with binary=True to create one-hot encoding.
Built vocabulary from the input sentences.
Transformed sentences into one-hot encoded vectors.
Printed vocabulary and vectors to verify encoding.
Results Interpretation

Before encoding, the text was raw and unusable by ML models.

After encoding, each sentence is represented as a vector of 0s and 1s indicating presence of words.

This numeric format can now be fed into ML models.

One-hot encoding converts text into a simple numeric format by marking which words appear in each sentence. This is a basic but important step to prepare text data for machine learning.
Bonus Experiment
Try using TF-IDF encoding instead of one-hot encoding to see how word importance affects the vectors.
💡 Hint
Use sklearn's TfidfVectorizer and compare the output vectors to one-hot vectors.