What is Working with large files efficiently in NumPy?

NumPydata~5 mins

Working with large files efficiently in NumPy

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Large files can be too big to load all at once. Working efficiently helps save memory and time.

You have a huge dataset that does not fit into your computer's memory.

You want to process data in parts instead of loading everything at once.

You need to read or write large numerical data quickly.

You want to avoid your program crashing due to memory overload.

You want to speed up data analysis by handling data in chunks.

Syntax

NumPy

import numpy as np

# Load part of a large file using memory mapping
array = np.memmap('filename.dat', dtype='float32', mode='r', shape=(1000, 1000))

np.memmap lets you access small parts of a big file without loading it all.

You specify the data type, mode (read or write), and shape of the data.

Examples

This opens a large file as if it were an array, but only loads parts when needed.

NumPy

import numpy as np

# Memory-map a large binary file for reading
data = np.memmap('data.bin', dtype='float64', mode='r', shape=(5000, 5000))

This creates a big file and writes numbers from 0 to 9999 without loading all in memory.

NumPy

import numpy as np

# Create a new memory-mapped file for writing
mmap_array = np.memmap('newfile.dat', dtype='int32', mode='w+', shape=(10000,))
mmap_array[:] = np.arange(10000)

For text files like CSV, read in smaller parts and convert to numpy arrays for processing.

NumPy

import numpy as np
import pandas as pd

# Read a large CSV file in chunks using pandas and convert to numpy arrays
chunks = pd.read_csv('large.csv', chunksize=10000)
for chunk in chunks:
    arr = chunk.to_numpy()
    # process arr here

Sample Program

This program creates a large file with 1 million numbers from 0 to 1. Then it reads only 10 numbers from the file without loading all data into memory.

NumPy

import numpy as np

# Create a large memory-mapped file and write data
filename = 'large_data.dat'
size = 1000000  # 1 million elements

# Create file with zeros
mmap_array = np.memmap(filename, dtype='float32', mode='w+', shape=(size,))
mmap_array[:] = np.linspace(0, 1, size)

# Flush changes to disk
mmap_array.flush()

# Now read only a slice without loading entire file
mmap_read = np.memmap(filename, dtype='float32', mode='r', shape=(size,))
slice_data = mmap_read[100000:100010]

print(slice_data)

OutputSuccess

Important Notes

Memory mapping works best with binary files, not text files.

Always specify the correct data type and shape to avoid errors.

Use chunk reading for large text files like CSVs, then convert to numpy arrays.

Summary

Use np.memmap to work with large binary files without loading all data.

Read or write data in parts to save memory and speed up processing.

For large text files, read in chunks and convert to numpy arrays for analysis.