Overview - Adding and removing categories

What is it?

Adding and removing categories is about changing the list of possible values in a pandas categorical data type. Categories are like labels that a column can have, and you can add new labels or remove old ones. This helps keep data organized and efficient when working with repeated values. It is useful when your data changes or you want to clean it up.

Why it matters

Without the ability to add or remove categories, you might have categories that don't match your data or miss new categories that appear. This can cause errors or waste memory. Being able to update categories keeps your data accurate and your programs faster. It also helps when you want to analyze or visualize data with the right groups.

Where it fits

Before learning this, you should understand what categorical data is in pandas and how to create it. After this, you can learn about category ordering, category renaming, and how categories affect grouping and plotting.

Mental Model

Core Idea

Categories in pandas are like a fixed set of labels for data, and adding or removing categories changes which labels are allowed without changing the actual data values.

Think of it like...

Imagine a box of colored pencils where each color is a category. Adding a category is like adding a new color pencil to the box, and removing a category is like taking a color pencil out. The drawings you made with the pencils don’t change, but the colors you can use in the future do.

Categories:
┌───────────────┐
│ Red           │
│ Blue          │
│ Green         │
└───────────────┘

Data values:
[Red, Blue, Red, Green, Blue]

Add category 'Yellow':
┌───────────────┐
│ Red           │
│ Blue          │
│ Green         │
│ Yellow        │
└───────────────┘

Remove category 'Green':
┌───────────────┐
│ Red           │
│ Blue          │
│ Yellow        │
└───────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding pandas categorical data

Concept: Learn what categorical data is and why pandas uses it.

Categorical data in pandas stores values that come from a fixed set of categories. This saves memory and speeds up operations compared to using strings. You create a categorical column by converting a normal column using pd.Categorical or .astype('category').

Result

You get a column with categories and codes representing the data values.

Understanding categorical data is key because adding or removing categories only makes sense when you know what categories are and how they represent data.

2

FoundationCreating and viewing categories

3

IntermediateAdding new categories safely

4

IntermediateRemoving categories carefully

5

IntermediateUsing set_categories to redefine categories

6

AdvancedHandling categories in DataFrames

7

ExpertPerformance and memory impact of category changes

Under the Hood

Pandas categorical data stores data as integer codes pointing to a separate list of categories. Adding categories appends new labels to this list without changing codes. Removing categories deletes labels from the list and replaces codes pointing to removed categories with a special missing value code. This separation allows efficient storage and fast operations.

Why designed this way?

This design balances memory efficiency and speed. Storing repeated values as codes saves space. Separating categories allows quick updates to allowed labels without rewriting all data. Alternatives like storing strings directly are slower and use more memory. The tradeoff is complexity in managing category lists and codes.

Data values (codes):
[0, 1, 0, 2]

Categories list:
┌─────┬─────┬─────┐
│ 'a' │ 'b' │ 'c' │
└─────┴─────┴─────┘

Add category 'd':
Categories list:
┌─────┬─────┬─────┬─────┐
│ 'a' │ 'b' │ 'c' │ 'd' │
└─────┴─────┴─────┴─────┘

Remove category 'b':
Categories list:
┌─────┬─────┐
│ 'a' │ 'c' │
└─────┴─────┘

Data codes updated:
[0, NaN, 0, 1]

Myth Busters - 4 Common Misconceptions

Quick: Does adding a category change existing data values? Commit yes or no.

Common Belief:Adding a category changes data values to use the new category automatically.

Tap to reveal reality

Quick: If you remove a category still used in data, does data stay the same? Commit yes or no.

Common Belief:Removing a category does not affect data values that use it.

Tap to reveal reality

Quick: Does set_categories always rename categories without changing data? Commit yes or no.

Common Belief:set_categories only renames categories and never affects data values.

Tap to reveal reality

Quick: Do categories in one DataFrame column affect others? Commit yes or no.

Common Belief:Categories are global across a DataFrame and changing one column changes all.

Tap to reveal reality

Expert Zone

1

Adding categories does not trigger data copy, but removing categories can cause data to be copied due to NaN insertion.

2

Categories can be ordered or unordered; adding/removing categories behaves differently depending on order status.

3

When categories are removed, pandas does not automatically reorder codes, which can lead to subtle bugs if codes are inspected directly.

When NOT to use

Avoid using categorical data with frequent category changes in streaming or real-time data; use object/string types instead. Also, do not remove categories if you need to preserve all original data values. For very large datasets with many unique values, consider other encoding methods like hashing.

Production Patterns

In production, categories are often fixed after data cleaning to ensure consistency. Adding categories is used when new data arrives with unseen labels. Removing categories is used to drop rare or irrelevant labels before modeling. set_categories is used to reorder or rename categories for reporting and visualization.

Connections

Database normalization

Both manage repeated values by referencing a fixed set of labels to save space and maintain consistency.

Understanding how databases use foreign keys to represent categories helps grasp why pandas separates data codes from category labels.

Enum types in programming languages

Categories in pandas are similar to enums, which define a fixed set of named constants.

Knowing enums clarifies why categories have fixed allowed values and how adding/removing categories is like changing enum definitions.

Taxonomy in biology

Both involve organizing items into fixed groups or categories that can be updated as knowledge changes.

Seeing categories as taxonomic groups helps understand the importance of adding/removing categories carefully to reflect true data structure.

Common Pitfalls

#1Removing a category that is still used in data without handling missing values.

Wrong approach:cat = cat.remove_categories(['used_category']) print(cat)

Correct approach:cat = cat.remove_categories(['used_category']) cat = cat.fillna('replacement') print(cat)

Root cause:Not realizing that removing categories replaces data values with NaN, which must be handled explicitly.

#2Adding categories and expecting data to update automatically to use new categories.

Wrong approach:cat = cat.add_categories(['new_cat']) print(cat) # expecting new_cat in data

Correct approach:cat = cat.add_categories(['new_cat']) cat = cat.append(pd.Categorical(['new_cat'], categories=cat.categories)) print(cat)

Root cause:Confusing allowed categories with actual data values.

#3Using set_categories with rename=False and losing data unintentionally.

Wrong approach:cat2 = cat.set_categories(['a', 'c']) print(cat2)

Correct approach:cat2 = cat.set_categories(['a', 'c'], rename=True) print(cat2)

Root cause:Not understanding the rename parameter changes how data values are treated.

Key Takeaways

Categories in pandas are fixed sets of labels that represent data values efficiently.

Adding categories expands the allowed labels without changing existing data values.

Removing categories deletes labels and converts data values using them into missing values.

set_categories can redefine categories but must be used carefully to avoid data loss.

Categories are specific to each column and managing them properly is key for accurate and efficient data analysis.