0
0
ML Pythonml~20 mins

One-hot encoding in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - One-hot encoding
Problem:You have a dataset with a categorical feature 'Color' having values like 'Red', 'Green', and 'Blue'. You want to convert this feature into a format that a machine learning model can understand.
Current Metrics:The model trained on raw categorical data without encoding achieves 60% accuracy on validation data.
Issue:The model cannot interpret categorical text data directly, leading to poor performance.
Your Task
Apply one-hot encoding to the 'Color' feature to improve model accuracy to at least 75%.
Use one-hot encoding only on the 'Color' feature.
Keep the rest of the dataset and model architecture unchanged.
Hint 1
Hint 2
Solution
ML Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample dataset
data = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red', 'Blue', 'Red', 'Green'],
    'Size': [1, 2, 3, 2, 1, 3, 1, 2],
    'Label': [0, 1, 0, 1, 0, 0, 0, 1]
})

# One-hot encode the 'Color' feature
color_encoded = pd.get_dummies(data['Color'], prefix='Color')

# Replace 'Color' column with encoded columns
data_encoded = pd.concat([data.drop('Color', axis=1), color_encoded], axis=1)

# Split features and target
X = data_encoded.drop('Label', axis=1)
y = data_encoded['Label']

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)

print(f"Validation Accuracy after one-hot encoding: {accuracy * 100:.2f}%")
Applied one-hot encoding to the 'Color' categorical feature using pandas get_dummies.
Replaced the original 'Color' column with the new binary columns representing each category.
Trained the same logistic regression model on the encoded data.
Results Interpretation

Before one-hot encoding, the model accuracy was 60%. After encoding, accuracy improved to 100% on validation data.

This shows the model better understands categorical data when it is converted into a numeric format it can process.

One-hot encoding transforms categorical text data into a numeric format that machine learning models can use effectively, often improving model accuracy.
Bonus Experiment
Try using label encoding instead of one-hot encoding on the 'Color' feature and compare the model accuracy.
💡 Hint
Label encoding assigns a unique number to each category but may introduce unintended order. Observe how this affects model performance.