How to Read Large CSV Files Efficiently in pandas
To read large CSV files in pandas, use the
pd.read_csv() function with the chunksize parameter to load the file in smaller parts. This approach helps manage memory by processing the file piece by piece instead of loading it all at once.Syntax
The main function to read CSV files in pandas is pd.read_csv(). To handle large files, you can use the chunksize parameter, which reads the file in smaller pieces called chunks.
filepath_or_buffer: Path to the CSV file.chunksize: Number of rows per chunk to read at a time.usecols: Select specific columns to reduce memory usage.dtype: Specify data types to optimize memory.
python
pd.read_csv(filepath_or_buffer, chunksize=10000, usecols=None, dtype=None)
Example
This example shows how to read a large CSV file in chunks of 5000 rows, process each chunk by counting rows, and combine results.
python
import pandas as pd chunk_size = 5000 row_count = 0 for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size): row_count += len(chunk) print(f'Total rows read: {row_count}')
Output
Total rows read: 25000
Common Pitfalls
Common mistakes when reading large CSV files include:
- Trying to load the entire file at once, causing memory errors.
- Not specifying
chunksizewhen the file is too large. - Ignoring data types, which can increase memory usage.
Always use chunksize for large files and consider specifying dtype and usecols to reduce memory.
python
import pandas as pd # Wrong: loading entire large file at once (may crash) # df = pd.read_csv('large_file.csv') # Right: read in chunks chunk_size = 10000 for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size): # process each chunk pass
Quick Reference
| Parameter | Description | Example |
|---|---|---|
| chunksize | Number of rows per chunk to read | chunksize=5000 |
| usecols | Select specific columns to load | usecols=['col1', 'col2'] |
| dtype | Specify data types for columns | dtype={'col1': 'int32'} |
| iterator | Return an iterator for manual chunk handling | iterator=True |
Key Takeaways
Use the chunksize parameter in pd.read_csv() to read large CSV files in smaller parts.
Specify data types and select only needed columns to reduce memory usage.
Avoid loading the entire large file at once to prevent memory errors.
Process each chunk separately to handle data efficiently.
Use iterator=True if you want manual control over chunk reading.