What is Handling categorical variables in ML Python?

ML Pythonprogramming~5 mins

Handling categorical variables in ML Python

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Categorical variables are words or labels, not numbers. We need to change them into numbers so computers can understand and learn from them.

You have data with categories like colors, countries, or types of animals.

You want to use machine learning models that only work with numbers.

You want to improve your model by correctly representing categories.

You need to prepare data for algorithms like decision trees or neural networks.

You want to avoid mistakes caused by treating categories as numbers directly.

Syntax

ML Python

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
categorical_data = [['red'], ['green'], ['blue']]
encoded_data = encoder.fit_transform(categorical_data)

OneHotEncoder turns each category into a new column with 0 or 1.

Use fit_transform on training data and transform on new data.

Examples

LabelEncoder changes categories into numbers like 0, 1, 2. Good for ordered categories.

ML Python

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
categories = ['cat', 'dog', 'fish']
encoded = le.fit_transform(categories)

OneHotEncoder creates separate columns for each category with 0 or 1.

ML Python

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
categories = [['cat'], ['dog'], ['fish']]
encoded = encoder.fit_transform(categories)

Pandas get_dummies quickly converts categories into one-hot columns.

ML Python

import pandas as pd
df = pd.DataFrame({'color': ['red', 'green', 'blue']})
dummies = pd.get_dummies(df['color'])

Sample Program

This program changes color names into numbers using one-hot encoding. Each color becomes a row with 1 in its color column and 0 elsewhere.

ML Python

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample categorical data
colors = np.array([['red'], ['green'], ['blue'], ['green'], ['red']])

# Create encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform data
encoded_colors = encoder.fit_transform(colors)

print('Original data:')
print(colors.flatten())
print('\nEncoded data:')
print(encoded_colors)

OutputSuccess

Important Notes

Never feed raw categories directly to most machine learning models.

One-hot encoding increases data size if many categories exist.

Label encoding can mislead models if categories have no order.

Summary

Categorical variables need to be converted to numbers for machine learning.

Use label encoding for ordered categories and one-hot encoding for unordered ones.

Tools like scikit-learn and pandas make encoding easy and reliable.