0
0
PandasHow-ToBeginner · 3 min read

How to Read Large CSV Files Efficiently in pandas

To read large CSV files in pandas, use the pd.read_csv() function with the chunksize parameter to load the file in smaller parts. This approach helps manage memory by processing the file piece by piece instead of loading it all at once.
📐

Syntax

The main function to read CSV files in pandas is pd.read_csv(). To handle large files, you can use the chunksize parameter, which reads the file in smaller pieces called chunks.

  • filepath_or_buffer: Path to the CSV file.
  • chunksize: Number of rows per chunk to read at a time.
  • usecols: Select specific columns to reduce memory usage.
  • dtype: Specify data types to optimize memory.
python
pd.read_csv(filepath_or_buffer, chunksize=10000, usecols=None, dtype=None)
💻

Example

This example shows how to read a large CSV file in chunks of 5000 rows, process each chunk by counting rows, and combine results.

python
import pandas as pd

chunk_size = 5000
row_count = 0

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    row_count += len(chunk)

print(f'Total rows read: {row_count}')
Output
Total rows read: 25000
⚠️

Common Pitfalls

Common mistakes when reading large CSV files include:

  • Trying to load the entire file at once, causing memory errors.
  • Not specifying chunksize when the file is too large.
  • Ignoring data types, which can increase memory usage.

Always use chunksize for large files and consider specifying dtype and usecols to reduce memory.

python
import pandas as pd

# Wrong: loading entire large file at once (may crash)
# df = pd.read_csv('large_file.csv')

# Right: read in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # process each chunk
    pass
📊

Quick Reference

ParameterDescriptionExample
chunksizeNumber of rows per chunk to readchunksize=5000
usecolsSelect specific columns to loadusecols=['col1', 'col2']
dtypeSpecify data types for columnsdtype={'col1': 'int32'}
iteratorReturn an iterator for manual chunk handlingiterator=True

Key Takeaways

Use the chunksize parameter in pd.read_csv() to read large CSV files in smaller parts.
Specify data types and select only needed columns to reduce memory usage.
Avoid loading the entire large file at once to prevent memory errors.
Process each chunk separately to handle data efficiently.
Use iterator=True if you want manual control over chunk reading.