0
0
Pandasdata~5 mins

Working with large datasets strategies in Pandas

Choose your learning style9 modes available
Introduction

Large datasets can be slow and hard to handle. Using smart strategies helps you work faster and avoid errors.

When your data file is too big to open in memory all at once.
When you want to speed up data loading and processing.
When you need to save memory while analyzing data.
When you want to process data in smaller parts instead of all at once.
When you want to avoid crashing your computer due to too much data.
Syntax
Pandas
import pandas as pd

# Read data in chunks
for chunk in pd.read_csv('data.csv', chunksize=10000):
    # process each chunk
    pass

# Use specific columns to reduce memory
df = pd.read_csv('data.csv', usecols=['col1', 'col2'])

# Convert data types to save memory
df['col1'] = df['col1'].astype('category')

Reading data in chunks lets you handle big files bit by bit.

Choosing only needed columns and changing data types saves memory.

Examples
This reads the big file in small parts and prints the size of each part.
Pandas
import pandas as pd

# Read CSV file in chunks of 5000 rows
for chunk in pd.read_csv('big_data.csv', chunksize=5000):
    print(chunk.shape)
This loads only the 'Name' and 'Age' columns from the file.
Pandas
import pandas as pd

# Load only two columns to save memory
df = pd.read_csv('big_data.csv', usecols=['Name', 'Age'])
print(df.head())
Changing a column to 'category' type reduces memory if it has repeated values.
Pandas
import pandas as pd

df = pd.read_csv('big_data.csv')
# Change 'Category' column to category type
if 'Category' in df.columns:
    df['Category'] = df['Category'].astype('category')
print(df.info())
Sample Program

This program shows three strategies: reading data in chunks, loading only needed columns, and converting a column to a category type to save memory.

Pandas
import pandas as pd

# Simulate reading a large CSV in chunks
# Here we create a sample CSV first
sample_data = pd.DataFrame({
    'ID': range(1, 20001),
    'Value': ['A']*10000 + ['B']*10000,
    'Number': range(20000, 0, -1)
})
sample_data.to_csv('sample_large.csv', index=False)

# Process the CSV in chunks and count values
value_counts = {}
for chunk in pd.read_csv('sample_large.csv', chunksize=5000):
    counts = chunk['Value'].value_counts()
    for val, count in counts.items():
        value_counts[val] = value_counts.get(val, 0) + count

print('Value counts from chunks:')
print(value_counts)

# Load only 'ID' and 'Number' columns to save memory
df_small = pd.read_csv('sample_large.csv', usecols=['ID', 'Number'])
print('\nDataFrame with selected columns:')
print(df_small.head())

# Convert 'Value' column to category type
df_full = pd.read_csv('sample_large.csv')
df_full['Value'] = df_full['Value'].astype('category')
print('\nMemory usage after converting to category:')
print(df_full.memory_usage(deep=True))
OutputSuccess
Important Notes

Always test chunk size to find the best balance between speed and memory.

Using 'category' type works best for columns with few unique values.

Loading only needed columns reduces memory but you lose other data.

Summary

Large datasets can be handled by reading in chunks, selecting columns, and changing data types.

These strategies help save memory and speed up processing.

Try these methods to work comfortably with big data on your computer.