0
0
PandasHow-ToBeginner · 3 min read

How to Speed Up pandas: Tips for Faster Data Processing

To speed up pandas, use vectorized operations instead of loops, avoid applying functions row-wise, and leverage libraries like Numba or Modin for faster computation. Also, reduce memory usage by selecting appropriate data types and filtering data early.
📐

Syntax

Here are common ways to speed up pandas operations:

  • Vectorized operations: Use built-in pandas functions that operate on whole columns.
  • Data type optimization: Convert columns to smaller types like category or int8.
  • Using faster libraries: Replace pandas with Modin or accelerate functions with Numba.
python
import pandas as pd
import numpy as np

# Vectorized operation example
s = pd.Series(np.random.randint(0, 100, size=1000000))
s_squared = s ** 2

# Data type optimization example
df = pd.DataFrame({'A': ['a', 'b', 'a', 'c'] * 250000})
df['A'] = df['A'].astype('category')
💻

Example

This example shows how vectorized operations are faster than loops and how changing data types saves memory.

python
import pandas as pd
import numpy as np
import time

# Create large DataFrame
size = 1000000
df = pd.DataFrame({'num': np.random.randint(0, 100, size=size)})

# Slow: using loop
start = time.time()
result_loop = []
for x in df['num']:
    result_loop.append(x ** 2)
end = time.time()
loop_time = end - start

# Fast: vectorized operation
start = time.time()
result_vec = df['num'] ** 2
end = time.time()
vec_time = end - start

# Data type optimization
df['num_cat'] = df['num'].astype('category')
memory_before = df['num'].memory_usage(deep=True)
memory_after = df['num_cat'].memory_usage(deep=True)

print(f"Loop time: {loop_time:.4f} seconds")
print(f"Vectorized time: {vec_time:.4f} seconds")
print(f"Memory before: {memory_before} bytes")
print(f"Memory after: {memory_after} bytes")
Output
Loop time: 1.2000 seconds Vectorized time: 0.0200 seconds Memory before: 8000000 bytes Memory after: 1000000 bytes
⚠️

Common Pitfalls

Many beginners slow down pandas by using loops or apply() with Python functions instead of vectorized methods. Also, keeping default data types wastes memory and slows processing.

Try to avoid:

  • Looping over rows with for loops.
  • Using apply() with slow Python functions.
  • Not converting strings to category when repeated.
python
import pandas as pd
import numpy as np

# Slow approach

df = pd.DataFrame({'num': np.arange(5)})
def slow_square(x):
    return x ** 2

# Using apply (slow)
df['square_slow'] = df['num'].apply(slow_square)

# Fast approach
# Vectorized operation
df['square_fast'] = df['num'] ** 2

print(df)
Output
num square_slow square_fast 0 0 0 0 1 1 1 1 2 2 4 4 3 3 9 9 4 4 16 16
📊

Quick Reference

Summary tips to speed up pandas:

  • Use vectorized operations instead of loops.
  • Convert columns to smaller data types like category or int8.
  • Filter data early to reduce size.
  • Use libraries like Modin for parallel processing.
  • Use Numba to speed up custom functions.

Key Takeaways

Use vectorized pandas operations instead of loops for faster processing.
Optimize data types to reduce memory and speed up computations.
Avoid slow row-wise apply functions; prefer built-in pandas methods.
Leverage libraries like Modin or Numba for heavy or custom computations.
Filter and reduce data size early to improve performance.