0
0
PandasHow-ToBeginner · 3 min read

How to Use chunksize in read_csv in pandas for Large Files

Use the chunksize parameter in pandas.read_csv() to read a large CSV file in smaller parts called chunks. It returns an iterator that yields DataFrames of the specified size, allowing you to process data piece by piece without loading the entire file into memory.
📐

Syntax

The chunksize parameter in pandas.read_csv() specifies the number of rows per chunk to read from the CSV file. When set, read_csv returns an iterator instead of a single DataFrame.

  • filepath_or_buffer: Path to the CSV file.
  • chunksize: Number of rows per chunk (integer).
  • Other parameters: Same as usual read_csv options.
python
pandas.read_csv(filepath_or_buffer, chunksize=number_of_rows, **kwargs)
💻

Example

This example shows how to read a CSV file in chunks of 3 rows each and print each chunk separately. This helps when working with large files that don't fit in memory.

python
import pandas as pd

# Create a sample CSV file
csv_data = '''name,age,city
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
David,40,Houston
Eva,28,Phoenix'''

with open('sample.csv', 'w') as f:
    f.write(csv_data)

# Read CSV in chunks of 3 rows
chunk_iter = pd.read_csv('sample.csv', chunksize=3)

for i, chunk in enumerate(chunk_iter, 1):
    print(f"Chunk {i}:")
    print(chunk)
    print()
Output
Chunk 1: name age city 0 Alice 30 New York 1 Bob 25 Los Angeles 2 Charlie 35 Chicago Chunk 2: name age city 3 David 40 Houston 4 Eva 28 Phoenix
⚠️

Common Pitfalls

  • Forgetting that read_csv with chunksize returns an iterator, not a DataFrame.
  • Trying to use DataFrame methods directly on the iterator without looping over chunks.
  • Not closing or exhausting the iterator, which can cause resource warnings.
  • Setting chunksize too small or too large, which affects performance.

Always loop over the chunks to process data piece by piece.

python
import pandas as pd

# Wrong: Trying to print the iterator directly
chunk_iter = pd.read_csv('sample.csv', chunksize=3)
print(chunk_iter)  # This prints the iterator object, not data

# Right: Loop over chunks to access data
for chunk in chunk_iter:
    print(chunk)
Output
<p>Text output:</p><p>&lt;pandas.io.parsers.TextFileReader object at 0x...&gt;</p><p>Then prints each chunk DataFrame correctly.</p>
📊

Quick Reference

ParameterDescriptionExample
filepath_or_bufferPath or URL of the CSV file'data.csv'
chunksizeNumber of rows per chunk to read1000
iteratorReturns an iterator if True (default False)True
usecolsSelect specific columns to read['name', 'age']
dtypeSpecify data types for columns{'age': int}

Key Takeaways

Use chunksize in read_csv to read large CSV files in smaller parts without loading all data at once.
read_csv with chunksize returns an iterator that yields DataFrames of the specified row size.
Always loop over the chunks to process or analyze data piece by piece.
Avoid treating the iterator as a DataFrame directly to prevent errors.
Choose chunksize based on memory limits and processing needs for best performance.