Overview - Memory savings with categoricals

What is it?

Memory savings with categoricals is a technique in pandas to reduce the amount of memory used by data columns that have repeated values. Instead of storing the full value for each row, pandas stores a smaller code that points to a list of unique values. This is especially useful for columns with many repeated strings or categories. It helps make data analysis faster and more efficient on large datasets.

Why it matters

Without memory savings, large datasets with repeated values can use a lot of memory, slowing down your computer and limiting the size of data you can work with. Using categoricals reduces memory use, allowing you to handle bigger datasets and speed up operations. This means you can analyze more data on your laptop or server without running out of memory.

Where it fits

Before learning this, you should understand basic pandas data structures like DataFrames and Series. After this, you can learn about performance optimization in pandas, such as using efficient data types and vectorized operations. This topic fits into the broader journey of making data analysis scalable and efficient.

Mental Model

Core Idea

Categoricals store repeated values once and use small codes to represent them, saving memory by avoiding repeated storage.

Think of it like...

Imagine a classroom where many students have the same favorite color. Instead of writing the color name next to each student, you give each color a number and write only the number for each student. This way, you write fewer words overall.

┌───────────────┐       ┌───────────────┐
│ Original Data │──────▶│ Unique Values │
│ (many repeats)│       │ (stored once) │
└───────────────┘       └───────────────┘
          │                      ▲
          ▼                      │
┌──────────────────────────────┐
│ Codes (small integers) stored │
│ for each row instead of full  │
│ repeated values               │
└──────────────────────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding pandas DataFrames and memory

Concept: Learn what pandas DataFrames are and how they store data in memory.

A pandas DataFrame is like a table with rows and columns. Each column has a data type, like numbers or text. Text columns (strings) can use a lot of memory if many rows repeat the same words. You can check memory use with df.memory_usage(deep=True).

Result

You can see how much memory each column uses, noticing that string columns often use more memory.

Knowing how pandas stores data helps you see why repeated strings waste memory and why optimizing this matters.

2

FoundationWhat are categorical data types?

3

IntermediateConverting columns to categorical type

4

IntermediateWhen categoricals save the most memory

5

IntermediateCategoricals and performance trade-offs

6

AdvancedCustomizing categorical categories and order

7

ExpertMemory savings internals and pitfalls

Under the Hood

Pandas stores categorical columns using two parts: a list of unique categories and an integer array of codes. Each code points to a category. This replaces storing full repeated values for each row. The integer codes use less memory because integers are smaller than strings. When you access the data, pandas maps codes back to categories on the fly.

Why designed this way?

This design balances memory savings and usability. Storing unique categories once avoids repetition, while integer codes allow fast indexing and operations. Alternatives like storing only strings waste memory, and other compression methods are slower or complex. This approach is simple, efficient, and integrates well with pandas.

┌───────────────┐       ┌───────────────┐
│ Category List │◀──────│ Integer Codes │
│ (unique vals) │       │ (one per row) │
└───────────────┘       └───────────────┘
          ▲                      │
          │                      ▼
    ┌─────────────────────────────────┐
    │  DataFrame column stores codes   │
    │  and references category list    │
    └─────────────────────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Do categoricals always reduce memory regardless of data uniqueness? Commit yes or no.

Common Belief:Categoricals always save memory no matter what data they hold.

Tap to reveal reality

Quick: Do categoricals change the visible data values? Commit yes or no.

Common Belief:Converting to categorical changes the actual data values you see.

Tap to reveal reality

Quick: Do categoricals always speed up all data operations? Commit yes or no.

Common Belief:Categoricals make every operation faster.

Tap to reveal reality

Quick: Can you compare categorical values logically without setting order? Commit yes or no.

Common Belief:Categorical values can be compared logically by default.

Tap to reveal reality

Expert Zone

1

The memory savings depend not only on the number of unique categories but also on the size of each category string stored once.

2

Ordered categoricals enable meaningful comparisons and sorting, which is crucial for categorical features in machine learning pipelines.

3

Converting large categorical columns back to strings repeatedly can negate memory savings and slow down processing.

When NOT to use

Avoid categoricals when columns have mostly unique values or when frequent string operations are needed. Use string types or specialized compression libraries instead.

Production Patterns

In production, categoricals are used to reduce memory in large datasets like user IDs, product categories, or survey responses. They are combined with chunked processing and saved in efficient formats like Parquet to optimize storage and speed.

Connections

Data Compression

Categoricals are a form of data compression by replacing repeated values with codes.

Understanding categoricals as compression helps connect data science with storage optimization techniques in computer science.

Database Indexing

Categoricals resemble database indexes that map values to keys for faster lookup.

Knowing this analogy helps understand how categoricals speed up grouping and filtering operations.

Human Language Encoding

Categoricals are like how humans use abbreviations or codes to represent common phrases to save effort.

This cross-domain link shows how efficient representation is a universal principle in communication and computing.

Common Pitfalls

#1Converting a column with many unique strings to categorical expecting memory savings.

Wrong approach:df['col'] = df['col'].astype('category') # on a column with mostly unique values

Correct approach:# Check unique values ratio before converting if df['col'].nunique() / len(df) < 0.5: df['col'] = df['col'].astype('category')

Root cause:Not checking the uniqueness ratio leads to converting columns where categoricals increase memory.

#2Assuming categorical columns can be compared with < or > without setting order.

Wrong approach:df['cat_col'] > 'medium' # when cat_col is unordered categorical

Correct approach:df['cat_col'] = pd.Categorical(df['cat_col'], categories=['small','medium','large'], ordered=True) df['cat_col'] > 'medium'

Root cause:Not setting ordered=True causes comparison errors or unexpected behavior.

#3Converting categorical columns back to strings repeatedly during processing.

Wrong approach:for chunk in data_chunks: chunk['cat_col'] = chunk['cat_col'].astype(str) # process strings

Correct approach:for chunk in data_chunks: # process categorical data directly or convert once if needed

Root cause:Repeated conversion wastes memory and CPU, negating benefits of categoricals.

Key Takeaways

Categorical data types save memory by storing unique values once and using small integer codes for each row.

Memory savings are greatest when columns have many repeated values and few unique categories.

Converting to categorical changes storage but not the visible data, so it is safe to use for analysis.

Categoricals can speed up some operations like grouping but may slow down string manipulations.

Understanding when and how to use categoricals prevents common mistakes and optimizes data processing.