0
0
PandasHow-ToBeginner · 3 min read

How to Reduce Memory Usage in pandas Efficiently

To reduce memory usage in pandas, convert columns to more efficient data types like category for text and smaller numeric types like int8 or float32. Also, load data selectively and drop unused columns early to save memory.
📐

Syntax

Here are common ways to reduce memory usage in pandas:

  • astype(): Change column data types to smaller types.
  • category: Use for columns with repeated text values.
  • read_csv() with usecols: Load only needed columns.
  • drop(): Remove unused columns early.
python
df['column'] = df['column'].astype('category')
df['num'] = df['num'].astype('int8')
df = pd.read_csv('file.csv', usecols=['col1', 'col2'])
df = df.drop(columns=['unneeded_col'])
💻

Example

This example shows how to load a CSV, convert a text column to category, and convert a numeric column to int8 to save memory.

python
import pandas as pd
import numpy as np

# Create example DataFrame
data = {
    'id': range(1, 6),
    'color': ['red', 'blue', 'red', 'green', 'blue'],
    'value': [100, 200, 150, 300, 250]
}
df = pd.DataFrame(data)

# Check initial memory usage
initial_mem = df.memory_usage(deep=True).sum()

# Convert 'color' to category
df['color'] = df['color'].astype('category')

# Convert 'value' to smaller int type
df['value'] = df['value'].astype('int8')

# Check reduced memory usage
reduced_mem = df.memory_usage(deep=True).sum()

print(f"Initial memory usage: {initial_mem} bytes")
print(f"Reduced memory usage: {reduced_mem} bytes")
print(df.dtypes)
Output
Initial memory usage: 376 bytes Reduced memory usage: 196 bytes id int64 color category value int8 dtype: object
⚠️

Common Pitfalls

Common mistakes when reducing memory in pandas include:

  • Converting numeric columns to types too small to hold values, causing overflow errors.
  • Using category for columns with many unique values, which can increase memory.
  • Not checking memory usage before and after conversions.

Always verify data ranges and unique counts before changing types.

python
import pandas as pd

# Wrong: converting int64 to int8 without checking range
s = pd.Series([0, 1000, 300])
try:
    s_wrong = s.astype('int8')  # This will cause incorrect values due to overflow
except Exception as e:
    print(f"Error: {e}")

# Right: convert to int16 which can hold values up to 32767
s_right = s.astype('int16')
print(s_right)
Output
Error: 0 0 1 1000 2 300 dtype: int16
📊

Quick Reference

Summary tips to reduce pandas memory usage:

  • Use astype('category') for repeated text columns.
  • Downcast numeric columns with astype() to smaller types like int8, int16, or float32.
  • Load only needed columns with usecols in read_csv().
  • Drop unused columns early with drop().
  • Check memory usage with memory_usage(deep=True) before and after changes.

Key Takeaways

Convert text columns to 'category' type to save memory on repeated strings.
Downcast numeric columns to smaller integer or float types when possible.
Load only necessary columns from files to reduce initial memory load.
Drop columns you don't need as early as possible in your workflow.
Always check memory usage before and after optimizations to confirm savings.