How to read large csv in pandas

PandasHow-ToBeginner · 3 min read

How to Read Large CSV Files Efficiently in pandas

To read large CSV files in pandas, use the pd.read_csv() function with the chunksize parameter to load the file in smaller parts. This approach helps manage memory by processing the file piece by piece instead of loading it all at once.

📐

Syntax

The main function to read CSV files in pandas is pd.read_csv(). To handle large files, you can use the chunksize parameter, which reads the file in smaller pieces called chunks.

filepath_or_buffer: Path to the CSV file.
chunksize: Number of rows per chunk to read at a time.
usecols: Select specific columns to reduce memory usage.
dtype: Specify data types to optimize memory.

python

pd.read_csv(filepath_or_buffer, chunksize=10000, usecols=None, dtype=None)

💻

Example

This example shows how to read a large CSV file in chunks of 5000 rows, process each chunk by counting rows, and combine results.

python

import pandas as pd

chunk_size = 5000
row_count = 0

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    row_count += len(chunk)

print(f'Total rows read: {row_count}')

Output

Total rows read: 25000

⚠️

Common Pitfalls

Common mistakes when reading large CSV files include:

Trying to load the entire file at once, causing memory errors.
Not specifying chunksize when the file is too large.
Ignoring data types, which can increase memory usage.

Always use chunksize for large files and consider specifying dtype and usecols to reduce memory.

python

import pandas as pd

# Wrong: loading entire large file at once (may crash)
# df = pd.read_csv('large_file.csv')

# Right: read in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # process each chunk
    pass

📊

Quick Reference

Parameter	Description	Example
chunksize	Number of rows per chunk to read	chunksize=5000
usecols	Select specific columns to load	usecols=['col1', 'col2']
dtype	Specify data types for columns	dtype={'col1': 'int32'}
iterator	Return an iterator for manual chunk handling	iterator=True

✅

Key Takeaways

Use the chunksize parameter in pd.read_csv() to read large CSV files in smaller parts.

Specify data types and select only needed columns to reduce memory usage.

Avoid loading the entire large file at once to prevent memory errors.

Process each chunk separately to handle data efficiently.

Use iterator=True if you want manual control over chunk reading.