0
0
Pandasdata~5 mins

Chunked reading for large files in Pandas

Choose your learning style9 modes available
Introduction

Sometimes files are too big to load all at once. Chunked reading helps us read big files bit by bit without crashing.

When your computer runs out of memory loading a big file.
When you want to process a large file step-by-step.
When you want to analyze or clean data in parts.
When you want to save time by reading only small pieces at a time.
Syntax
Pandas
pd.read_csv('filename.csv', chunksize=number)

The chunksize sets how many rows to read at once.

This returns an iterator you can loop over to get each chunk as a DataFrame.

Examples
Read the file in chunks of 1000 rows and print the first 5 rows of each chunk.
Pandas
chunks = pd.read_csv('data.csv', chunksize=1000)
for chunk in chunks:
    print(chunk.head())
Read the file in chunks of 5000 rows and store all chunks in a list, then print how many chunks were read.
Pandas
chunks = pd.read_csv('data.csv', chunksize=5000)
chunk_list = [chunk for chunk in chunks]
print(len(chunk_list))
Sample Program

This program creates a large CSV file with 10,000 rows, then reads it in chunks of 3,000 rows. It sums the 'value' column in each chunk and adds them up to get the total sum.

Pandas
import pandas as pd

# Create a sample CSV file with 10,000 rows
sample_data = pd.DataFrame({
    'id': range(1, 10001),
    'value': [x * 2 for x in range(1, 10001)]
})
sample_data.to_csv('large_sample.csv', index=False)

# Read the CSV file in chunks of 3000 rows
chunks = pd.read_csv('large_sample.csv', chunksize=3000)

# Process each chunk: calculate sum of 'value' column
total_sum = 0
for chunk in chunks:
    total_sum += chunk['value'].sum()

print(f'Total sum of value column: {total_sum}')
OutputSuccess
Important Notes

Using chunked reading helps avoid memory errors with big files.

You can process or filter each chunk before combining results.

Remember to close files or use context managers if needed.

Summary

Chunked reading lets you handle big files in small parts.

Use chunksize in pd.read_csv to read pieces.

Process each chunk separately to save memory and time.