How to process large dataset pandas

PandasHow-ToBeginner · 3 min read

How to Process Large Datasets Efficiently with pandas

To process large datasets in pandas, use read_csv() with the chunksize parameter to load data in smaller parts. Also, optimize memory by converting columns to efficient data types and filter data early to reduce size.

📐

Syntax

Use pandas.read_csv() with chunksize to read large files in parts. Convert columns with astype() to save memory. Filter data with boolean indexing to reduce size before processing.

pd.read_csv(filepath, chunksize=n): reads file in chunks of n rows.
df.astype(dtype): changes column data type.
df[df['col'] > value]: filters rows based on condition.

python

import pandas as pd

# Read CSV in chunks of 10000 rows
chunk_iter = pd.read_csv('large_file.csv', chunksize=10000)

# Example DataFrame for conversion and filtering
# Assuming df is a DataFrame loaded from a chunk or file
# Convert column to category type for memory efficiency
df['category_col'] = df['category_col'].astype('category')

# Filter rows where column 'age' > 30
df_filtered = df[df['age'] > 30]

💻

Example

This example shows how to read a large CSV file in chunks, filter rows where the value in column 'score' is above 50, and combine the filtered chunks into one DataFrame.

python

import pandas as pd

chunks = []
for chunk in pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv', chunksize=50):
    filtered = chunk[chunk['Height(Inches)'] > 65]
    chunks.append(filtered)

result = pd.concat(chunks)
print(result.head())

Output

Index Height(Inches) Weight(Pounds) 2 3 66.5 112.0 3 4 69.0 120.0 4 5 69.0 150.0 5 6 70.0 145.0 6 7 72.0 171.0

⚠️

Common Pitfalls

Trying to load a very large dataset all at once can cause your computer to run out of memory. Also, not converting data types wastes memory. Forgetting to filter early means you process unnecessary data, slowing down your work.

Wrong way: loading entire file at once.

python

import pandas as pd

# Wrong: loads entire file, may crash if file is huge
df = pd.read_csv('large_file.csv')

# Right: load in chunks and process piece by piece
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    filtered = chunk[chunk['value'] > 100]
    chunks.append(filtered)
df_filtered = pd.concat(chunks)

📊

Quick Reference

Use chunksize in read_csv to load data in parts.
Convert columns to category or smaller numeric types to save memory.
Filter data early to reduce size before heavy processing.
Use pd.concat() to combine processed chunks.

✅

Key Takeaways

Use chunksize in read_csv to handle large files in smaller parts.

Convert columns to efficient data types like category to reduce memory use.

Filter data early to avoid processing unnecessary rows.

Combine processed chunks with pd.concat() for final analysis.