0
0
Data Analysis Pythondata~10 mins

Categorical data type optimization in Data Analysis Python - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Categorical data type optimization
Load DataFrame with object columns
Check unique values count
Convert suitable columns to 'category'
Compare memory usage before and after
Use optimized DataFrame for analysis
This flow shows how to convert object columns with repeated values into categorical type to save memory and speed up analysis.
Execution Sample
Data Analysis Python
import pandas as pd

# Create sample data
s = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana', 'banana'])

# Convert to category
d = s.astype('category')

# Show memory usage
print(s.memory_usage(), d.memory_usage())
This code creates a series of fruit names, converts it to a categorical type, and compares memory usage.
Execution Table
StepActionInput/VariableResult/Output
1Create Series s['apple', 'banana', 'apple', 'orange', 'banana', 'banana']s: object dtype, 6 elements
2Check s.memory_usage()sMemory usage: 176 bytes
3Convert s to categorys.astype('category')d: categorical dtype, 6 elements
4Check d.memory_usage()dMemory usage: 136 bytes
5Compare memory usages.memory_usage() vs d.memory_usage()Categorical uses less memory
6Use d for analysisdOptimized data for faster operations
💡 Conversion done, memory usage reduced, ready for optimized analysis
Variable Tracker
VariableStartAfter ConversionFinal
sSeries of strings (object)Same (unchanged)Series of strings (object)
dNot definedSeries of categorical typeSeries of categorical type
Key Moments - 3 Insights
Why does converting to 'category' reduce memory usage?
Because categorical stores unique values once and uses integer codes for data, as shown in execution_table step 4 where memory usage drops.
Can all object columns be converted to categorical?
No, only columns with repeated values and limited unique entries benefit; otherwise, it may not save memory or speed.
Does converting to categorical change the data values?
No, the visible data stays the same; only internal storage changes, as seen in variable_tracker where values remain consistent.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the memory usage of the original Series s at step 2?
A136 bytes
B176 bytes
C6 bytes
D200 bytes
💡 Hint
Check the 'Result/Output' column at step 2 in the execution_table.
At which step does the Series get converted to categorical type?
AStep 1
BStep 5
CStep 3
DStep 6
💡 Hint
Look for the action mentioning 'Convert s to category' in the execution_table.
If the Series had all unique values, what would likely happen to memory usage after conversion?
AMemory usage would stay about the same or increase
BMemory usage would become zero
CMemory usage would decrease significantly
DConversion would fail
💡 Hint
Recall key_moments about when categorical conversion is beneficial.
Concept Snapshot
Categorical data type optimization:
- Use pd.Series.astype('category') to convert object columns
- Saves memory by storing unique values once
- Speeds up operations on repeated values
- Best for columns with few unique entries
- Check memory usage before and after conversion
Full Transcript
We start with a Series of strings representing fruits. Initially, it uses more memory because each string is stored fully. By converting this Series to a categorical type, pandas stores each unique fruit once and uses integer codes for the data. This reduces memory usage, as shown by comparing memory usage before and after conversion. Not all columns benefit; only those with repeated values and limited unique entries. The data values remain the same visually, only the internal storage changes. This optimization helps speed up data analysis and reduces memory load.