0
0
ML Pythonprogramming~5 mins

Handling categorical variables in ML Python

Choose your learning style9 modes available
Introduction
Categorical variables are words or labels, not numbers. We need to change them into numbers so computers can understand and learn from them.
You have data with categories like colors, countries, or types of animals.
You want to use machine learning models that only work with numbers.
You want to improve your model by correctly representing categories.
You need to prepare data for algorithms like decision trees or neural networks.
You want to avoid mistakes caused by treating categories as numbers directly.
Syntax
ML Python
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
categorical_data = [['red'], ['green'], ['blue']]
encoded_data = encoder.fit_transform(categorical_data)
OneHotEncoder turns each category into a new column with 0 or 1.
Use fit_transform on training data and transform on new data.
Examples
LabelEncoder changes categories into numbers like 0, 1, 2. Good for ordered categories.
ML Python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
categories = ['cat', 'dog', 'fish']
encoded = le.fit_transform(categories)
OneHotEncoder creates separate columns for each category with 0 or 1.
ML Python
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
categories = [['cat'], ['dog'], ['fish']]
encoded = encoder.fit_transform(categories)
Pandas get_dummies quickly converts categories into one-hot columns.
ML Python
import pandas as pd
df = pd.DataFrame({'color': ['red', 'green', 'blue']})
dummies = pd.get_dummies(df['color'])
Sample Program
This program changes color names into numbers using one-hot encoding. Each color becomes a row with 1 in its color column and 0 elsewhere.
ML Python
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample categorical data
colors = np.array([['red'], ['green'], ['blue'], ['green'], ['red']])

# Create encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform data
encoded_colors = encoder.fit_transform(colors)

print('Original data:')
print(colors.flatten())
print('\nEncoded data:')
print(encoded_colors)
OutputSuccess
Important Notes
Never feed raw categories directly to most machine learning models.
One-hot encoding increases data size if many categories exist.
Label encoding can mislead models if categories have no order.
Summary
Categorical variables need to be converted to numbers for machine learning.
Use label encoding for ordered categories and one-hot encoding for unordered ones.
Tools like scikit-learn and pandas make encoding easy and reliable.