0
0
PandasHow-ToBeginner · 3 min read

How to Use Category dtype in pandas for Efficient Data

Use category dtype in pandas by converting a column with astype('category'). This makes data use less memory and speeds up operations on repeated values.
📐

Syntax

The basic syntax to convert a pandas column to category dtype is:

  • df['column'] = df['column'].astype('category'): Converts the column to category dtype.
  • pd.Categorical(data, categories=..., ordered=...): Creates a categorical object with optional categories and order.
python
df['column'] = df['column'].astype('category')

# or create categorical directly
cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c'], ordered=True)
💻

Example

This example shows how to convert a column to category dtype, check memory usage before and after, and see the categories.

python
import pandas as pd

# Create a DataFrame with repeated string values
data = {'fruit': ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']}
df = pd.DataFrame(data)

# Memory usage before conversion
mem_before = df.memory_usage(deep=True).sum()

# Convert 'fruit' column to category dtype
df['fruit'] = df['fruit'].astype('category')

# Memory usage after conversion
mem_after = df.memory_usage(deep=True).sum()

# Show DataFrame info and categories
info = df.info()
categories = df['fruit'].cat.categories

(mem_before, mem_after, categories.tolist())
Output
(240, 144, ['apple', 'banana', 'orange'])
⚠️

Common Pitfalls

Common mistakes when using category dtype include:

  • Not converting columns with many unique values, which can increase overhead.
  • Forgetting to specify ordered=True when order matters (like sizes or ratings).
  • Trying to add new categories without updating the category list.

Always check if category dtype fits your data pattern.

python
import pandas as pd

# Wrong: Adding new category without updating categories
s = pd.Series(['small', 'medium', 'large']).astype('category')
try:
    s[3] = 'extra large'  # This will raise an error
except Exception as e:
    error = str(e)

# Right: Add new category before assignment
s = pd.Series(['small', 'medium', 'large']).astype('category')
s = s.cat.add_categories(['extra large'])
s[3] = 'extra large'

(error, s.tolist())
Output
('Cannot setitem on a Categorical with a new category, set the categories first', ['small', 'medium', 'large', 'extra large'])
📊

Quick Reference

OperationSyntaxDescription
Convert column to categorydf['col'] = df['col'].astype('category')Change dtype to category for memory and speed benefits
Create categorical with orderpd.Categorical(data, categories=[...], ordered=True)Create ordered categorical data
Add new categorys = s.cat.add_categories(['new_cat'])Add new category before assigning new values
Remove unused categoriess = s.cat.remove_unused_categories()Clean categories not present in data
Get categoriess.cat.categoriesView all categories in the series

Key Takeaways

Convert string columns with repeated values to category dtype using astype('category') to save memory.
Use ordered=True in pd.Categorical if the categories have a meaningful order.
Add new categories explicitly before assigning new values to avoid errors.
Category dtype speeds up comparisons and group operations on categorical data.
Check memory usage before and after to see benefits of category dtype.