0
0
Data Analysis Pythondata~5 mins

Categorical data type optimization in Data Analysis Python

Choose your learning style9 modes available
Introduction

Using categorical data type saves memory and speeds up analysis when working with repeated text values.

You have a column with repeated text labels like colors or categories.
You want to reduce the memory used by your dataset.
You want faster grouping or filtering on text columns.
You need to prepare data for machine learning models that work better with categories.
You want to improve performance when working with large datasets.
Syntax
Data Analysis Python
df['column_name'] = df['column_name'].astype('category')
Use astype('category') to convert a column to categorical type.
Categorical columns store data more efficiently by keeping unique values only once.
Examples
This converts the 'color' column to categorical type.
Data Analysis Python
df['color'] = df['color'].astype('category')
Convert 'city' column and print all unique categories.
Data Analysis Python
df['city'] = df['city'].astype('category')
print(df['city'].cat.categories)
Create an ordered categorical column with specific category order.
Data Analysis Python
df['size'] = pd.Categorical(df['size'], categories=['small', 'medium', 'large'], ordered=True)
Sample Program

This code shows how converting a text column to categorical reduces memory and lists the unique categories.

Data Analysis Python
import pandas as pd

# Create a sample DataFrame with repeated text values
data = {'color': ['red', 'blue', 'red', 'green', 'blue', 'blue', 'red']}
df = pd.DataFrame(data)

# Check memory usage before conversion
mem_before = df.memory_usage(deep=True).sum()

# Convert 'color' column to categorical
df['color'] = df['color'].astype('category')

# Check memory usage after conversion
mem_after = df.memory_usage(deep=True).sum()

# Show categories and memory usage difference
print('Categories:', df['color'].cat.categories)
print(f'Memory before: {mem_before} bytes')
print(f'Memory after: {mem_after} bytes')
OutputSuccess
Important Notes

Categorical data is best for columns with many repeated values.

Ordered categories allow comparisons like less than or greater than.

Converting back to string is easy with astype(str).

Summary

Categorical type saves memory by storing unique values once.

Use astype('category') to convert columns.

It speeds up filtering and grouping operations.