Overview - Converting to categorical

What is it?

Converting to categorical means changing data columns to a special type that stores repeated values efficiently. Instead of storing the same text many times, it stores a list of unique categories and uses numbers to represent them. This saves memory and speeds up some operations. It is useful when data has a limited set of possible values, like colors or countries.

Why it matters

Without categorical conversion, data with repeated values wastes memory and slows down analysis. Large datasets become harder to handle and slower to process. Using categorical types makes data smaller and faster to work with, which is important for real-world tasks like analyzing customer segments or survey answers. It also helps algorithms understand that some data is not continuous but belongs to groups.

Where it fits

Before learning this, you should know how to use pandas DataFrames and basic data types like strings and numbers. After this, you can learn about encoding techniques for machine learning, like one-hot encoding, and how to optimize data storage and performance in pandas.

Mental Model

Core Idea

Converting to categorical replaces repeated values with small codes pointing to unique categories, saving space and improving speed.

Think of it like...

It's like having a dictionary where each word is assigned a number, and instead of writing the word many times, you just write its number. This makes the text shorter and easier to handle.

Data column: [red, blue, red, green, blue]
Unique categories: {red:0, blue:1, green:2}
Stored as codes: [0, 1, 0, 2, 1]

Build-Up - 7 Steps

1

FoundationUnderstanding basic data types in pandas

Concept: Learn what data types pandas uses to store data and why they matter.

Pandas stores data in columns with types like int, float, and object (usually strings). Each type uses memory differently. For example, strings stored as 'object' take more space because each value is stored separately.

Result

You see that columns with repeated strings use more memory than numeric columns.

Knowing data types helps you understand why converting to categorical can save memory.

2

FoundationWhat is a categorical data type?

3

IntermediateHow to convert columns to categorical

4

IntermediateBenefits of categorical conversion

5

IntermediateHandling categories: ordered vs unordered

6

AdvancedCustomizing categories and adding new ones

7

ExpertMemory and performance trade-offs in large datasets

Under the Hood

Pandas stores categorical data as two arrays: one with unique categories and one with integer codes pointing to these categories. When you access data, pandas translates codes back to categories. This reduces memory because integers use less space than strings. Operations like comparisons work on codes, which are faster than string operations.

Why designed this way?

Categorical types were designed to handle repeated values efficiently, inspired by database normalization and factor types in R. This design balances memory savings and speed without changing data semantics. Alternatives like storing strings directly waste memory and slow down operations.

┌───────────────────────────────┐
│ Original column:              │
│ [red, blue, red, green, blue]│
└──────────────┬────────────────┘
               │ convert to categorical
               ▼
┌───────────────────────────────┐
│ Categories: [red, blue, green] │
│ Codes:      [0, 1, 0, 2, 1]    │
└───────────────────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does converting to categorical change the visible data values? Commit yes or no.

Common Belief:Converting to categorical changes the actual data values in the column.

Tap to reveal reality

Quick: Does categorical conversion always speed up all operations? Commit yes or no.

Common Belief:Categorical conversion always makes data operations faster.

Tap to reveal reality

Quick: Can you add any new value to a categorical column without errors? Commit yes or no.

Common Belief:You can add any new value to a categorical column after conversion without extra steps.

Tap to reveal reality

Quick: Are ordered categories the default when converting? Commit yes or no.

Common Belief:Categories are ordered by default after conversion.

Tap to reveal reality

Expert Zone

1

Categorical codes are stored as the smallest possible integer type (int8, int16, etc.) to save memory further.

2

When saving DataFrames with categorical columns to disk, categories and codes are preserved, enabling efficient storage and reload.

3

Categorical types integrate with pandas groupby and pivot operations to speed up aggregation by working on codes.

When NOT to use

Avoid categorical conversion for columns with mostly unique values (like IDs or timestamps) because it adds overhead without benefits. Use numeric or string types instead. For machine learning, sometimes one-hot encoding or embedding is better than categorical codes.

Production Patterns

In production, categorical conversion is used to optimize memory in large datasets like customer demographics or survey responses. It is combined with explicit category ordering for ordinal data and careful category management to handle new data gracefully.

Connections

One-hot encoding

Builds-on

Understanding categorical codes helps grasp one-hot encoding, which transforms categories into binary vectors for machine learning.

Database normalization

Same pattern

Categorical conversion mirrors database normalization by storing repeated values once and referencing them, reducing redundancy.

Data compression algorithms

Similar principle

Both categorical conversion and compression replace repeated data with shorter codes to save space, showing a shared efficiency goal.

Common Pitfalls

#1Trying to add a new value to a categorical column without updating categories.

Wrong approach:df['col'][5] = 'new_value' # after conversion without adding category

Correct approach:df['col'] = df['col'].cat.add_categories(['new_value']) df['col'][5] = 'new_value'

Root cause:Misunderstanding that categories are fixed sets and new values must be registered first.

#2Assuming categorical conversion always reduces memory.

Wrong approach:df['col'] = df['col'].astype('category') # on a column with mostly unique values

Correct approach:Keep column as string/object if unique values are many, or test memory before and after conversion.

Root cause:Not considering data uniqueness and size before converting.

#3Using unordered categories when order matters.

Wrong approach:df['size'] = df['size'].astype('category') # sizes like small, medium, large unordered

Correct approach:df['size'] = pd.Categorical(df['size'], categories=['small', 'medium', 'large'], ordered=True)

Root cause:Ignoring the importance of order in ordinal data.

Key Takeaways

Converting to categorical changes how data is stored, not the data itself, by replacing repeated values with codes.

Categorical types save memory and speed up operations when data has many repeated values and few unique categories.

You must manage categories carefully, especially when adding new values or working with ordered data.

Categorical conversion is not always beneficial; its effectiveness depends on data size and uniqueness.

Understanding categorical data connects to broader concepts like database normalization and data compression.