How to Process Large Datasets Efficiently with pandas
To process large datasets in
pandas, use read_csv() with the chunksize parameter to load data in smaller parts. Also, optimize memory by converting columns to efficient data types and filter data early to reduce size.Syntax
Use pandas.read_csv() with chunksize to read large files in parts. Convert columns with astype() to save memory. Filter data with boolean indexing to reduce size before processing.
pd.read_csv(filepath, chunksize=n): reads file in chunks ofnrows.df.astype(dtype): changes column data type.df[df['col'] > value]: filters rows based on condition.
python
import pandas as pd # Read CSV in chunks of 10000 rows chunk_iter = pd.read_csv('large_file.csv', chunksize=10000) # Example DataFrame for conversion and filtering # Assuming df is a DataFrame loaded from a chunk or file # Convert column to category type for memory efficiency df['category_col'] = df['category_col'].astype('category') # Filter rows where column 'age' > 30 df_filtered = df[df['age'] > 30]
Example
This example shows how to read a large CSV file in chunks, filter rows where the value in column 'score' is above 50, and combine the filtered chunks into one DataFrame.
python
import pandas as pd chunks = [] for chunk in pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv', chunksize=50): filtered = chunk[chunk['Height(Inches)'] > 65] chunks.append(filtered) result = pd.concat(chunks) print(result.head())
Output
Index Height(Inches) Weight(Pounds)
2 3 66.5 112.0
3 4 69.0 120.0
4 5 69.0 150.0
5 6 70.0 145.0
6 7 72.0 171.0
Common Pitfalls
Trying to load a very large dataset all at once can cause your computer to run out of memory. Also, not converting data types wastes memory. Forgetting to filter early means you process unnecessary data, slowing down your work.
Wrong way: loading entire file at once.
python
import pandas as pd # Wrong: loads entire file, may crash if file is huge df = pd.read_csv('large_file.csv') # Right: load in chunks and process piece by piece chunks = [] for chunk in pd.read_csv('large_file.csv', chunksize=10000): filtered = chunk[chunk['value'] > 100] chunks.append(filtered) df_filtered = pd.concat(chunks)
Quick Reference
- Use
chunksizeinread_csvto load data in parts. - Convert columns to
categoryor smaller numeric types to save memory. - Filter data early to reduce size before heavy processing.
- Use
pd.concat()to combine processed chunks.
Key Takeaways
Use
chunksize in read_csv to handle large files in smaller parts.Convert columns to efficient data types like
category to reduce memory use.Filter data early to avoid processing unnecessary rows.
Combine processed chunks with
pd.concat() for final analysis.