0
0
MlopsHow-ToBeginner · 3 min read

How to Use get_dummies in pandas for Python Data Processing

Use pandas.get_dummies() to convert categorical columns into one-hot encoded numeric columns. This function creates new columns for each category with 1s and 0s, making data ready for machine learning models.
📐

Syntax

The basic syntax of pandas.get_dummies() is:

  • data: The input data (DataFrame or Series) containing categorical variables.
  • prefix: Optional string or list to prepend to new column names.
  • drop_first: Boolean to drop the first category to avoid multicollinearity (default is False).
  • columns: List of columns to encode; if None, all categorical columns are encoded.
python
pandas.get_dummies(data, prefix=None, drop_first=False, columns=None)
💻

Example

This example shows how to convert a DataFrame with a categorical column into one-hot encoded columns using get_dummies. It demonstrates how each category becomes a new column with 1 or 0 values.

python
import pandas as pd

# Sample data with a categorical column
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Convert categorical 'Color' column to dummy variables
encoded_data = pd.get_dummies(data, columns=['Color'])

print(encoded_data)
Output
Color_Blue Color_Green Color_Red 0 0 0 1 1 1 0 0 2 0 1 0 3 1 0 0 4 0 0 1
⚠️

Common Pitfalls

Common mistakes when using get_dummies include:

  • Not specifying the columns parameter, which may encode all object columns unintentionally.
  • Forgetting to drop one dummy column using drop_first=True to avoid multicollinearity in linear models.
  • Applying get_dummies separately on train and test sets, causing mismatched columns.

Always ensure consistent columns between datasets and consider dropping the first dummy if needed.

python
import pandas as pd

# Wrong: applying get_dummies separately on train and test
train = pd.DataFrame({'Color': ['Red', 'Blue']})
test = pd.DataFrame({'Color': ['Green', 'Blue']})

train_encoded = pd.get_dummies(train, columns=['Color'])
test_encoded = pd.get_dummies(test, columns=['Color'])

print('Train columns:', train_encoded.columns.tolist())
print('Test columns:', test_encoded.columns.tolist())

# Right: fit on train, reindex test to match columns
all_colors = ['Color_Blue', 'Color_Green', 'Color_Red']
test_encoded = pd.get_dummies(test, columns=['Color']).reindex(columns=all_colors, fill_value=0)

print('Aligned test columns:', test_encoded.columns.tolist())
Output
Train columns: ['Color_Blue', 'Color_Red'] Test columns: ['Color_Blue', 'Color_Green'] Aligned test columns: ['Color_Blue', 'Color_Green', 'Color_Red']
📊

Quick Reference

ParameterDescriptionDefault
dataInput DataFrame or Series to encodeRequired
prefixString or list to prepend to new columnsNone
drop_firstDrop first category to avoid multicollinearityFalse
columnsList of columns to encode; if None, all categorical columnsNone
dummy_naAdd column for NaN valuesFalse

Key Takeaways

Use pandas.get_dummies() to convert categorical columns into numeric one-hot encoded columns.
Specify columns parameter to control which columns get encoded.
Use drop_first=True to avoid redundant columns in some models.
Ensure train and test datasets have matching dummy columns to avoid errors.
get_dummies creates new columns with 1s and 0s representing category presence.