How to Use chunksize in read_csv in pandas for Large Files
Use the
chunksize parameter in pandas.read_csv() to read a large CSV file in smaller parts called chunks. It returns an iterator that yields DataFrames of the specified size, allowing you to process data piece by piece without loading the entire file into memory.Syntax
The chunksize parameter in pandas.read_csv() specifies the number of rows per chunk to read from the CSV file. When set, read_csv returns an iterator instead of a single DataFrame.
- filepath_or_buffer: Path to the CSV file.
- chunksize: Number of rows per chunk (integer).
- Other parameters: Same as usual
read_csvoptions.
python
pandas.read_csv(filepath_or_buffer, chunksize=number_of_rows, **kwargs)
Example
This example shows how to read a CSV file in chunks of 3 rows each and print each chunk separately. This helps when working with large files that don't fit in memory.
python
import pandas as pd # Create a sample CSV file csv_data = '''name,age,city Alice,30,New York Bob,25,Los Angeles Charlie,35,Chicago David,40,Houston Eva,28,Phoenix''' with open('sample.csv', 'w') as f: f.write(csv_data) # Read CSV in chunks of 3 rows chunk_iter = pd.read_csv('sample.csv', chunksize=3) for i, chunk in enumerate(chunk_iter, 1): print(f"Chunk {i}:") print(chunk) print()
Output
Chunk 1:
name age city
0 Alice 30 New York
1 Bob 25 Los Angeles
2 Charlie 35 Chicago
Chunk 2:
name age city
3 David 40 Houston
4 Eva 28 Phoenix
Common Pitfalls
- Forgetting that
read_csvwithchunksizereturns an iterator, not a DataFrame. - Trying to use DataFrame methods directly on the iterator without looping over chunks.
- Not closing or exhausting the iterator, which can cause resource warnings.
- Setting
chunksizetoo small or too large, which affects performance.
Always loop over the chunks to process data piece by piece.
python
import pandas as pd # Wrong: Trying to print the iterator directly chunk_iter = pd.read_csv('sample.csv', chunksize=3) print(chunk_iter) # This prints the iterator object, not data # Right: Loop over chunks to access data for chunk in chunk_iter: print(chunk)
Output
<p>Text output:</p><p><pandas.io.parsers.TextFileReader object at 0x...></p><p>Then prints each chunk DataFrame correctly.</p>
Quick Reference
| Parameter | Description | Example |
|---|---|---|
| filepath_or_buffer | Path or URL of the CSV file | 'data.csv' |
| chunksize | Number of rows per chunk to read | 1000 |
| iterator | Returns an iterator if True (default False) | True |
| usecols | Select specific columns to read | ['name', 'age'] |
| dtype | Specify data types for columns | {'age': int} |
Key Takeaways
Use chunksize in read_csv to read large CSV files in smaller parts without loading all data at once.
read_csv with chunksize returns an iterator that yields DataFrames of the specified row size.
Always loop over the chunks to process or analyze data piece by piece.
Avoid treating the iterator as a DataFrame directly to prevent errors.
Choose chunksize based on memory limits and processing needs for best performance.