How to Use Category dtype in pandas for Efficient Data
Use
category dtype in pandas by converting a column with astype('category'). This makes data use less memory and speeds up operations on repeated values.Syntax
The basic syntax to convert a pandas column to category dtype is:
df['column'] = df['column'].astype('category'): Converts the column to category dtype.pd.Categorical(data, categories=..., ordered=...): Creates a categorical object with optional categories and order.
python
df['column'] = df['column'].astype('category') # or create categorical directly cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c'], ordered=True)
Example
This example shows how to convert a column to category dtype, check memory usage before and after, and see the categories.
python
import pandas as pd # Create a DataFrame with repeated string values data = {'fruit': ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']} df = pd.DataFrame(data) # Memory usage before conversion mem_before = df.memory_usage(deep=True).sum() # Convert 'fruit' column to category dtype df['fruit'] = df['fruit'].astype('category') # Memory usage after conversion mem_after = df.memory_usage(deep=True).sum() # Show DataFrame info and categories info = df.info() categories = df['fruit'].cat.categories (mem_before, mem_after, categories.tolist())
Output
(240, 144, ['apple', 'banana', 'orange'])
Common Pitfalls
Common mistakes when using category dtype include:
- Not converting columns with many unique values, which can increase overhead.
- Forgetting to specify
ordered=Truewhen order matters (like sizes or ratings). - Trying to add new categories without updating the category list.
Always check if category dtype fits your data pattern.
python
import pandas as pd # Wrong: Adding new category without updating categories s = pd.Series(['small', 'medium', 'large']).astype('category') try: s[3] = 'extra large' # This will raise an error except Exception as e: error = str(e) # Right: Add new category before assignment s = pd.Series(['small', 'medium', 'large']).astype('category') s = s.cat.add_categories(['extra large']) s[3] = 'extra large' (error, s.tolist())
Output
('Cannot setitem on a Categorical with a new category, set the categories first', ['small', 'medium', 'large', 'extra large'])
Quick Reference
| Operation | Syntax | Description |
|---|---|---|
| Convert column to category | df['col'] = df['col'].astype('category') | Change dtype to category for memory and speed benefits |
| Create categorical with order | pd.Categorical(data, categories=[...], ordered=True) | Create ordered categorical data |
| Add new category | s = s.cat.add_categories(['new_cat']) | Add new category before assigning new values |
| Remove unused categories | s = s.cat.remove_unused_categories() | Clean categories not present in data |
| Get categories | s.cat.categories | View all categories in the series |
Key Takeaways
Convert string columns with repeated values to category dtype using astype('category') to save memory.
Use ordered=True in pd.Categorical if the categories have a meaningful order.
Add new categories explicitly before assigning new values to avoid errors.
Category dtype speeds up comparisons and group operations on categorical data.
Check memory usage before and after to see benefits of category dtype.