How to Speed Up pandas: Tips for Faster Data Processing
To speed up
pandas, use vectorized operations instead of loops, avoid applying functions row-wise, and leverage libraries like Numba or Modin for faster computation. Also, reduce memory usage by selecting appropriate data types and filtering data early.Syntax
Here are common ways to speed up pandas operations:
- Vectorized operations: Use built-in pandas functions that operate on whole columns.
- Data type optimization: Convert columns to smaller types like
categoryorint8. - Using faster libraries: Replace pandas with
Modinor accelerate functions withNumba.
python
import pandas as pd import numpy as np # Vectorized operation example s = pd.Series(np.random.randint(0, 100, size=1000000)) s_squared = s ** 2 # Data type optimization example df = pd.DataFrame({'A': ['a', 'b', 'a', 'c'] * 250000}) df['A'] = df['A'].astype('category')
Example
This example shows how vectorized operations are faster than loops and how changing data types saves memory.
python
import pandas as pd import numpy as np import time # Create large DataFrame size = 1000000 df = pd.DataFrame({'num': np.random.randint(0, 100, size=size)}) # Slow: using loop start = time.time() result_loop = [] for x in df['num']: result_loop.append(x ** 2) end = time.time() loop_time = end - start # Fast: vectorized operation start = time.time() result_vec = df['num'] ** 2 end = time.time() vec_time = end - start # Data type optimization df['num_cat'] = df['num'].astype('category') memory_before = df['num'].memory_usage(deep=True) memory_after = df['num_cat'].memory_usage(deep=True) print(f"Loop time: {loop_time:.4f} seconds") print(f"Vectorized time: {vec_time:.4f} seconds") print(f"Memory before: {memory_before} bytes") print(f"Memory after: {memory_after} bytes")
Output
Loop time: 1.2000 seconds
Vectorized time: 0.0200 seconds
Memory before: 8000000 bytes
Memory after: 1000000 bytes
Common Pitfalls
Many beginners slow down pandas by using loops or apply() with Python functions instead of vectorized methods. Also, keeping default data types wastes memory and slows processing.
Try to avoid:
- Looping over rows with
forloops. - Using
apply()with slow Python functions. - Not converting strings to
categorywhen repeated.
python
import pandas as pd import numpy as np # Slow approach df = pd.DataFrame({'num': np.arange(5)}) def slow_square(x): return x ** 2 # Using apply (slow) df['square_slow'] = df['num'].apply(slow_square) # Fast approach # Vectorized operation df['square_fast'] = df['num'] ** 2 print(df)
Output
num square_slow square_fast
0 0 0 0
1 1 1 1
2 2 4 4
3 3 9 9
4 4 16 16
Quick Reference
Summary tips to speed up pandas:
- Use vectorized operations instead of loops.
- Convert columns to smaller data types like
categoryorint8. - Filter data early to reduce size.
- Use libraries like
Modinfor parallel processing. - Use
Numbato speed up custom functions.
Key Takeaways
Use vectorized pandas operations instead of loops for faster processing.
Optimize data types to reduce memory and speed up computations.
Avoid slow row-wise apply functions; prefer built-in pandas methods.
Leverage libraries like Modin or Numba for heavy or custom computations.
Filter and reduce data size early to improve performance.