0
0
Pandasdata~15 mins

Adding and removing categories in Pandas - Deep Dive

Choose your learning style9 modes available
Overview - Adding and removing categories
What is it?
Adding and removing categories is about changing the list of possible values in a pandas categorical data type. Categories are like labels that a column can have, and you can add new labels or remove old ones. This helps keep data organized and efficient when working with repeated values. It is useful when your data changes or you want to clean it up.
Why it matters
Without the ability to add or remove categories, you might have categories that don't match your data or miss new categories that appear. This can cause errors or waste memory. Being able to update categories keeps your data accurate and your programs faster. It also helps when you want to analyze or visualize data with the right groups.
Where it fits
Before learning this, you should understand what categorical data is in pandas and how to create it. After this, you can learn about category ordering, category renaming, and how categories affect grouping and plotting.
Mental Model
Core Idea
Categories in pandas are like a fixed set of labels for data, and adding or removing categories changes which labels are allowed without changing the actual data values.
Think of it like...
Imagine a box of colored pencils where each color is a category. Adding a category is like adding a new color pencil to the box, and removing a category is like taking a color pencil out. The drawings you made with the pencils don’t change, but the colors you can use in the future do.
Categories:
┌───────────────┐
│ Red           │
│ Blue          │
│ Green         │
└───────────────┘

Data values:
[Red, Blue, Red, Green, Blue]

Add category 'Yellow':
┌───────────────┐
│ Red           │
│ Blue          │
│ Green         │
│ Yellow        │
└───────────────┘

Remove category 'Green':
┌───────────────┐
│ Red           │
│ Blue          │
│ Yellow        │
└───────────────┘
Build-Up - 7 Steps
1
FoundationUnderstanding pandas categorical data
🤔
Concept: Learn what categorical data is and why pandas uses it.
Categorical data in pandas stores values that come from a fixed set of categories. This saves memory and speeds up operations compared to using strings. You create a categorical column by converting a normal column using pd.Categorical or .astype('category').
Result
You get a column with categories and codes representing the data values.
Understanding categorical data is key because adding or removing categories only makes sense when you know what categories are and how they represent data.
2
FoundationCreating and viewing categories
🤔
Concept: How to create a categorical column and see its categories.
Use pd.Categorical(['a', 'b', 'a', 'c']) to create categorical data. Use .categories attribute to see the list of categories. Example: import pandas as pd cat = pd.Categorical(['a', 'b', 'a', 'c']) print(cat.categories) # Output: Index(['a', 'b', 'c'], dtype='object')
Result
You see the categories: 'a', 'b', and 'c'.
Knowing how to check categories helps you understand what labels your data currently uses before changing them.
3
IntermediateAdding new categories safely
🤔Before reading on: do you think adding a category changes existing data values? Commit to your answer.
Concept: You can add new categories without changing existing data values using .add_categories().
If you have a categorical column and want to allow new labels, use .add_categories(['new_label']). This adds the label to the allowed categories but does not change any data values. Example: import pandas as pd cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b']) cat = cat.add_categories(['c']) print(cat.categories) # Output: Index(['a', 'b', 'c'], dtype='object')
Result
The categories now include 'c', but data values remain 'a' and 'b'.
Understanding that adding categories only changes the allowed labels, not the data, prevents confusion and errors when updating categories.
4
IntermediateRemoving categories carefully
🤔Before reading on: what happens if you remove a category that is still used in data? Commit to your answer.
Concept: Removing categories with .remove_categories() deletes labels from allowed categories but can make data values invalid or missing.
If you remove a category that is still used in data, those data points become NaN (missing). Example: import pandas as pd cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']) cat = cat.remove_categories(['b']) print(cat) # Output: [a, NaN, a] print(cat.categories) # Output: Index(['a', 'c'], dtype='object')
Result
Data values that used 'b' become missing because 'b' is no longer a category.
Knowing that removing categories can cause missing data helps you avoid accidental data loss.
5
IntermediateUsing set_categories to redefine categories
🤔Before reading on: does set_categories change data values or just categories? Commit to your answer.
Concept: set_categories replaces the entire list of categories and can keep or drop data values not in the new list.
You can use .set_categories(new_list, rename=False, inplace=False) to replace categories. If rename=True, it renames categories without changing data. If rename=False, data values not in new categories become NaN. Example: import pandas as pd cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']) cat2 = cat.set_categories(['a', 'c']) print(cat2) # Output: [a, NaN, a] print(cat2.categories) # Output: Index(['a', 'c'], dtype='object')
Result
Categories are replaced; data values not in new categories become missing.
Understanding set_categories lets you redefine categories flexibly but requires care to avoid data loss.
6
AdvancedHandling categories in DataFrames
🤔Before reading on: do you think adding categories to one column affects others? Commit to your answer.
Concept: Categories are per column; adding or removing categories affects only that column's categorical data.
In a DataFrame, each categorical column has its own categories. You can add or remove categories on one column without changing others. Example: import pandas as pd df = pd.DataFrame({'color': pd.Categorical(['red', 'blue']), 'shape': pd.Categorical(['circle', 'square'])}) df['color'] = df['color'].cat.add_categories(['green']) print(df['color'].cat.categories) # Output: Index(['blue', 'green', 'red'], dtype='object') print(df['shape'].cat.categories) # Output: Index(['circle', 'square'], dtype='object')
Result
Only the 'color' column categories changed; 'shape' categories stayed the same.
Knowing categories are column-specific helps manage complex datasets with multiple categorical columns.
7
ExpertPerformance and memory impact of category changes
🤔Before reading on: does adding categories increase memory usage immediately or only when data uses them? Commit to your answer.
Concept: Adding categories increases the category list size but does not increase memory for data until new categories are used; removing categories can reduce memory but may cause missing data.
Categories are stored separately from data values as codes. Adding categories adds to the category list, which uses some memory, but data codes stay the same size. Removing categories can reduce category list size but may convert some data to NaN, which uses different memory. Example: import pandas as pd cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b']) print(cat.memory_usage()) cat = cat.add_categories(['c']) print(cat.memory_usage()) cat = cat.remove_categories(['b']) print(cat.memory_usage())
Result
Memory usage changes slightly with category list size; data codes memory stays stable unless data changes.
Understanding memory impact helps optimize large datasets and avoid surprises in performance.
Under the Hood
Pandas categorical data stores data as integer codes pointing to a separate list of categories. Adding categories appends new labels to this list without changing codes. Removing categories deletes labels from the list and replaces codes pointing to removed categories with a special missing value code. This separation allows efficient storage and fast operations.
Why designed this way?
This design balances memory efficiency and speed. Storing repeated values as codes saves space. Separating categories allows quick updates to allowed labels without rewriting all data. Alternatives like storing strings directly are slower and use more memory. The tradeoff is complexity in managing category lists and codes.
Data values (codes):
[0, 1, 0, 2]

Categories list:
┌─────┬─────┬─────┐
│ 'a' │ 'b' │ 'c' │
└─────┴─────┴─────┘

Add category 'd':
Categories list:
┌─────┬─────┬─────┬─────┐
│ 'a' │ 'b' │ 'c' │ 'd' │
└─────┴─────┴─────┴─────┘

Remove category 'b':
Categories list:
┌─────┬─────┐
│ 'a' │ 'c' │
└─────┴─────┘

Data codes updated:
[0, NaN, 0, 1]
Myth Busters - 4 Common Misconceptions
Quick: Does adding a category change existing data values? Commit yes or no.
Common Belief:Adding a category changes data values to use the new category automatically.
Tap to reveal reality
Reality:Adding categories only changes the list of allowed categories; existing data values stay the same.
Why it matters:Believing this causes confusion and bugs when data does not update as expected after adding categories.
Quick: If you remove a category still used in data, does data stay the same? Commit yes or no.
Common Belief:Removing a category does not affect data values that use it.
Tap to reveal reality
Reality:Removing a category used in data replaces those data values with missing (NaN).
Why it matters:Ignoring this can cause unexpected missing data and analysis errors.
Quick: Does set_categories always rename categories without changing data? Commit yes or no.
Common Belief:set_categories only renames categories and never affects data values.
Tap to reveal reality
Reality:set_categories can drop data values not in new categories, turning them into missing values unless rename=True.
Why it matters:Misusing set_categories can cause silent data loss.
Quick: Do categories in one DataFrame column affect others? Commit yes or no.
Common Belief:Categories are global across a DataFrame and changing one column changes all.
Tap to reveal reality
Reality:Categories are specific to each column; changing one column's categories does not affect others.
Why it matters:Assuming global categories leads to incorrect data handling in multi-column datasets.
Expert Zone
1
Adding categories does not trigger data copy, but removing categories can cause data to be copied due to NaN insertion.
2
Categories can be ordered or unordered; adding/removing categories behaves differently depending on order status.
3
When categories are removed, pandas does not automatically reorder codes, which can lead to subtle bugs if codes are inspected directly.
When NOT to use
Avoid using categorical data with frequent category changes in streaming or real-time data; use object/string types instead. Also, do not remove categories if you need to preserve all original data values. For very large datasets with many unique values, consider other encoding methods like hashing.
Production Patterns
In production, categories are often fixed after data cleaning to ensure consistency. Adding categories is used when new data arrives with unseen labels. Removing categories is used to drop rare or irrelevant labels before modeling. set_categories is used to reorder or rename categories for reporting and visualization.
Connections
Database normalization
Both manage repeated values by referencing a fixed set of labels to save space and maintain consistency.
Understanding how databases use foreign keys to represent categories helps grasp why pandas separates data codes from category labels.
Enum types in programming languages
Categories in pandas are similar to enums, which define a fixed set of named constants.
Knowing enums clarifies why categories have fixed allowed values and how adding/removing categories is like changing enum definitions.
Taxonomy in biology
Both involve organizing items into fixed groups or categories that can be updated as knowledge changes.
Seeing categories as taxonomic groups helps understand the importance of adding/removing categories carefully to reflect true data structure.
Common Pitfalls
#1Removing a category that is still used in data without handling missing values.
Wrong approach:cat = cat.remove_categories(['used_category']) print(cat)
Correct approach:cat = cat.remove_categories(['used_category']) cat = cat.fillna('replacement') print(cat)
Root cause:Not realizing that removing categories replaces data values with NaN, which must be handled explicitly.
#2Adding categories and expecting data to update automatically to use new categories.
Wrong approach:cat = cat.add_categories(['new_cat']) print(cat) # expecting new_cat in data
Correct approach:cat = cat.add_categories(['new_cat']) cat = cat.append(pd.Categorical(['new_cat'], categories=cat.categories)) print(cat)
Root cause:Confusing allowed categories with actual data values.
#3Using set_categories with rename=False and losing data unintentionally.
Wrong approach:cat2 = cat.set_categories(['a', 'c']) print(cat2)
Correct approach:cat2 = cat.set_categories(['a', 'c'], rename=True) print(cat2)
Root cause:Not understanding the rename parameter changes how data values are treated.
Key Takeaways
Categories in pandas are fixed sets of labels that represent data values efficiently.
Adding categories expands the allowed labels without changing existing data values.
Removing categories deletes labels and converts data values using them into missing values.
set_categories can redefine categories but must be used carefully to avoid data loss.
Categories are specific to each column and managing them properly is key for accurate and efficient data analysis.