How to Reduce Memory Usage in pandas Efficiently
To reduce memory usage in
pandas, convert columns to more efficient data types like category for text and smaller numeric types like int8 or float32. Also, load data selectively and drop unused columns early to save memory.Syntax
Here are common ways to reduce memory usage in pandas:
astype(): Change column data types to smaller types.category: Use for columns with repeated text values.read_csv()withusecols: Load only needed columns.drop(): Remove unused columns early.
python
df['column'] = df['column'].astype('category') df['num'] = df['num'].astype('int8') df = pd.read_csv('file.csv', usecols=['col1', 'col2']) df = df.drop(columns=['unneeded_col'])
Example
This example shows how to load a CSV, convert a text column to category, and convert a numeric column to int8 to save memory.
python
import pandas as pd import numpy as np # Create example DataFrame data = { 'id': range(1, 6), 'color': ['red', 'blue', 'red', 'green', 'blue'], 'value': [100, 200, 150, 300, 250] } df = pd.DataFrame(data) # Check initial memory usage initial_mem = df.memory_usage(deep=True).sum() # Convert 'color' to category df['color'] = df['color'].astype('category') # Convert 'value' to smaller int type df['value'] = df['value'].astype('int8') # Check reduced memory usage reduced_mem = df.memory_usage(deep=True).sum() print(f"Initial memory usage: {initial_mem} bytes") print(f"Reduced memory usage: {reduced_mem} bytes") print(df.dtypes)
Output
Initial memory usage: 376 bytes
Reduced memory usage: 196 bytes
id int64
color category
value int8
dtype: object
Common Pitfalls
Common mistakes when reducing memory in pandas include:
- Converting numeric columns to types too small to hold values, causing overflow errors.
- Using
categoryfor columns with many unique values, which can increase memory. - Not checking memory usage before and after conversions.
Always verify data ranges and unique counts before changing types.
python
import pandas as pd # Wrong: converting int64 to int8 without checking range s = pd.Series([0, 1000, 300]) try: s_wrong = s.astype('int8') # This will cause incorrect values due to overflow except Exception as e: print(f"Error: {e}") # Right: convert to int16 which can hold values up to 32767 s_right = s.astype('int16') print(s_right)
Output
Error:
0 0
1 1000
2 300
dtype: int16
Quick Reference
Summary tips to reduce pandas memory usage:
- Use
astype('category')for repeated text columns. - Downcast numeric columns with
astype()to smaller types likeint8,int16, orfloat32. - Load only needed columns with
usecolsinread_csv(). - Drop unused columns early with
drop(). - Check memory usage with
memory_usage(deep=True)before and after changes.
Key Takeaways
Convert text columns to 'category' type to save memory on repeated strings.
Downcast numeric columns to smaller integer or float types when possible.
Load only necessary columns from files to reduce initial memory load.
Drop columns you don't need as early as possible in your workflow.
Always check memory usage before and after optimizations to confirm savings.