0
0
Pandasdata~15 mins

Category codes and labels in Pandas - Deep Dive

Choose your learning style9 modes available
Overview - Category codes and labels
What is it?
Category codes and labels in pandas are a way to represent categorical data efficiently. Instead of storing repeated text values, pandas stores integer codes that point to unique category labels. This saves memory and speeds up operations on data with many repeated values. It is especially useful for columns with a limited set of possible values.
Why it matters
Without category codes and labels, data with repeated text values takes more memory and is slower to process. For example, a column with thousands of 'Yes' or 'No' entries wastes space storing the same words repeatedly. Using categories reduces memory use and speeds up filtering, grouping, and sorting. This makes data analysis faster and more efficient, especially with large datasets.
Where it fits
Before learning category codes and labels, you should understand basic pandas DataFrames and data types. After this, you can learn about advanced data manipulation, memory optimization, and performance tuning in pandas. This concept also connects to data cleaning and preparation steps in data science workflows.
Mental Model
Core Idea
Category codes are numbers that point to unique labels, letting pandas store repeated values efficiently by referencing instead of repeating.
Think of it like...
Imagine a classroom where each student has a unique ID number, but their names are stored only once on a list. Instead of writing the full name every time, you just write the ID number. This saves space and makes it faster to find students by their ID.
Categories:
┌─────────────┐
│ Labels List │
│ 0: 'Yes'   │
│ 1: 'No'    │
│ 2: 'Maybe' │
└─────────────┘

Data column:
[0, 1, 0, 2, 1, 0]

Each number is a code pointing to a label.
Build-Up - 7 Steps
1
FoundationUnderstanding categorical data basics
🤔
Concept: Categorical data means data that can take only a limited set of values, like colors or yes/no answers.
In pandas, categorical data is stored differently from normal text data. Instead of repeating the same words many times, pandas can store a list of unique categories and then use codes to represent each value. This is like using a shortcut to save space.
Result
You learn that categorical data is a special type that helps save memory and speed up operations.
Understanding that some data has limited possible values helps you see why special storage methods like categories are useful.
2
FoundationCreating categorical columns in pandas
🤔
Concept: You can convert a normal text column to a categorical type in pandas using the .astype('category') method.
Example: import pandas as pd colors = pd.Series(['red', 'blue', 'red', 'green', 'blue']) colors_cat = colors.astype('category') print(colors_cat) print(colors_cat.cat.categories) print(colors_cat.cat.codes)
Result
The output shows the categorical column, the list of unique categories ['blue', 'green', 'red'], and the integer codes [2, 0, 2, 1, 0].
Knowing how to create categorical columns is the first step to using category codes and labels effectively.
3
IntermediateAccessing category codes and labels
🤔
Concept: You can access the integer codes with .cat.codes and the labels with .cat.categories on a categorical Series.
Example: import pandas as pd fruits = pd.Series(['apple', 'banana', 'apple', 'orange']) fruits_cat = fruits.astype('category') codes = fruits_cat.cat.codes labels = fruits_cat.cat.categories print('Codes:', codes.tolist()) print('Labels:', labels.tolist())
Result
Codes: [0, 1, 0, 2] Labels: ['apple', 'banana', 'orange']
Separating codes and labels lets you work with efficient numbers internally while keeping meaningful names for humans.
4
IntermediateCustomizing category order and labels
🤔Before reading on: Do you think category labels are always sorted alphabetically by default? Commit to your answer.
Concept: You can define your own order of categories and labels instead of the default alphabetical order.
Example: import pandas as pd sizes = pd.Series(['small', 'large', 'medium', 'small']) cat_type = pd.CategoricalDtype(categories=['small', 'medium', 'large'], ordered=True) sizes_cat = sizes.astype(cat_type) print(sizes_cat.cat.categories) print(sizes_cat.cat.codes)
Result
Categories: ['small', 'medium', 'large'] Codes: [0, 2, 1, 0]
Knowing you can control category order is important for meaningful sorting and comparisons.
5
IntermediateMemory and performance benefits of categories
🤔Before reading on: Do you think converting text columns to categories always reduces memory usage? Commit to your answer.
Concept: Using categories reduces memory when there are many repeated values, but not always if unique values are many.
Example: import pandas as pd import sys texts = pd.Series(['word'] * 10000) texts_cat = texts.astype('category') print('Original size:', sys.getsizeof(texts)) print('Categorical size:', sys.getsizeof(texts_cat))
Result
Original size is much larger than categorical size, showing memory savings.
Understanding when categories save memory helps you decide when to use them.
6
AdvancedHandling missing values in categories
🤔Before reading on: Do you think missing values get their own category code? Commit to your answer.
Concept: Missing values (NaN) in categorical data have a special code of -1 and are not part of the category labels.
Example: import pandas as pd data = pd.Series(['cat', 'dog', None, 'cat']) cat_data = data.astype('category') print(cat_data.cat.codes.tolist()) print(cat_data.cat.categories.tolist())
Result
Codes: [0, 1, -1, 0] Categories: ['cat', 'dog']
Knowing how missing data is coded prevents bugs when analyzing or transforming categorical data.
7
ExpertCategory internals and performance surprises
🤔Before reading on: Do you think category codes are always stored as int8? Commit to your answer.
Concept: Category codes use the smallest integer type needed, but operations can cause unexpected type promotions affecting performance.
Internally, pandas chooses the smallest integer type (int8, int16, etc.) for codes. However, some operations like adding new categories or merging can promote codes to larger integer types, increasing memory use. Also, categorical comparisons are faster than string comparisons because they compare integers.
Result
Understanding this helps optimize memory and speed in large data pipelines.
Knowing the internal integer type behavior helps avoid hidden performance costs in complex workflows.
Under the Hood
Pandas stores categorical data as two parts: a list of unique category labels and an integer array of codes. Each code is an index pointing to a label. This means repeated values are stored once, and the data column stores only integers. When you access the data, pandas replaces codes with labels for display. Missing values get a special code -1. Internally, pandas uses the smallest integer type that fits all codes to save memory.
Why designed this way?
This design was chosen to optimize memory and speed for data with repeated values. Storing repeated strings wastes memory and slows down operations like sorting and grouping. Using codes allows fast integer operations and less memory use. Alternatives like storing strings directly or using dictionaries for mapping were less efficient or more complex. This approach balances simplicity, speed, and memory savings.
┌───────────────┐       ┌───────────────┐
│ Category List │◄──────│ Codes Array   │
│ ['A', 'B', 'C']│       │ [0, 1, 0, 2]  │
└───────────────┘       └───────────────┘
         ▲                      │
         │                      ▼
     Displayed as:       ['A', 'B', 'A', 'C']

Missing values coded as -1, not in category list.
Myth Busters - 4 Common Misconceptions
Quick: Do you think category codes store the actual text values internally? Commit to yes or no.
Common Belief:Category codes store the actual text values inside the codes array.
Tap to reveal reality
Reality:Category codes store only integers that point to unique text labels stored separately.
Why it matters:Thinking codes store text leads to confusion about memory use and performance, causing inefficient data handling.
Quick: Do you think missing values get a normal category code? Commit to yes or no.
Common Belief:Missing values in categorical data get a normal category code like other values.
Tap to reveal reality
Reality:Missing values have a special code -1 and are not part of the category labels.
Why it matters:Misunderstanding this can cause errors in filtering or analysis when missing data is treated as a normal category.
Quick: Do you think converting any text column to category always reduces memory? Commit to yes or no.
Common Belief:Converting any text column to category always reduces memory usage.
Tap to reveal reality
Reality:If the column has many unique values, categories can use more memory than plain text.
Why it matters:Blindly converting to category can increase memory use and slow down processing.
Quick: Do you think category labels are always sorted alphabetically? Commit to yes or no.
Common Belief:Category labels are always sorted alphabetically by default.
Tap to reveal reality
Reality:By default, pandas sorts labels alphabetically, but you can define a custom order.
Why it matters:Assuming fixed order can cause bugs in sorting or comparisons if custom orders are needed.
Expert Zone
1
Category codes use the smallest integer type possible, but operations like adding categories can promote the type, affecting memory.
2
Categorical data comparisons are faster than string comparisons because they compare integer codes, not text.
3
Missing values are coded as -1, which is outside the normal category range, so special care is needed when filtering or replacing.
When NOT to use
Avoid using categories when the column has many unique values close to the number of rows, as this can increase memory and slow down operations. For free text or high-cardinality data, use string types or specialized text processing methods instead.
Production Patterns
In production, categories are used for columns like gender, country codes, or product types to save memory and speed up grouping and filtering. Data pipelines often convert text columns to categories after cleaning. Custom category orders are used for meaningful sorting, like sizes or ratings. Missing data handling with categories is carefully managed to avoid analysis errors.
Connections
One-hot encoding
Alternative encoding method for categorical data
Understanding category codes helps grasp one-hot encoding, which represents categories as binary vectors instead of integer codes, useful for machine learning.
Database indexing
Similar concept of using keys to represent data efficiently
Category codes are like database indexes that point to unique values, improving query speed and reducing storage.
Compression algorithms
Both reduce repeated data by referencing unique elements
Category codes work like compression by replacing repeated text with small codes, saving space and speeding access.
Common Pitfalls
#1Treating category codes as the actual data values
Wrong approach:df['col'] = df['col'].cat.codes print(df['col']) # Using codes directly without mapping back
Correct approach:print(df['col']) # Use the categorical column directly to see labels print(df['col'].cat.codes) # Use codes only for internal operations
Root cause:Misunderstanding that codes are internal pointers, not the real data.
#2Assuming missing values have a category code
Wrong approach:df['col'].cat.codes.replace(-1, 0) # Replacing missing code with valid code
Correct approach:df['col'].fillna('missing_label') # Handle missing values explicitly before converting
Root cause:Not knowing missing values have a special code -1 outside category range.
#3Converting high-cardinality text columns to category blindly
Wrong approach:df['unique_text'] = df['unique_text'].astype('category') # For column with mostly unique values
Correct approach:Keep as string or use other encoding methods for high-cardinality data
Root cause:Belief that categories always save memory regardless of data uniqueness.
Key Takeaways
Category codes store integers that point to unique category labels, saving memory and speeding up operations.
You can access and customize category labels and codes separately for flexible data handling.
Missing values have a special code -1 and are not part of the category labels.
Categories save memory only when data has many repeated values, not for high-uniqueness columns.
Understanding category internals helps avoid common bugs and optimize performance in data workflows.