Overview - Memory-mapped files with np.memmap

What is it?

Memory-mapped files with np.memmap allow you to work with large arrays stored on disk as if they were in memory. Instead of loading the entire file into RAM, only parts needed are loaded on demand. This helps handle data too big to fit in memory at once. You can read and write to these files efficiently without using much RAM.

Why it matters

Without memory-mapped files, working with very large datasets would require loading everything into memory, which can crash programs or slow them down. Memory mapping solves this by letting you access data directly on disk, saving memory and speeding up processing. This is crucial for big data, scientific computing, or any task with huge arrays.

Where it fits

Before learning np.memmap, you should understand basic numpy arrays and file input/output. After mastering memory mapping, you can explore advanced data handling techniques like chunking, out-of-core computation, and integration with databases or distributed systems.

Mental Model

Core Idea

Memory-mapped files let you treat parts of a large file on disk as if they were in memory, loading only what you need when you need it.

Think of it like...

It's like having a huge book on a shelf and only pulling out the pages you want to read instead of carrying the whole book around.

┌─────────────────────────────┐
│ Large file on disk          │
│ ┌───────────────────────┐ │
│ │ Memory-mapped array    │ │
│ │ ┌───────────────┐     │ │
│ │ │ Loaded pages  │◄────┤ │
│ │ └───────────────┘     │ │
│ └───────────────────────┘ │
└─────────────────────────────┘

Access only needed parts, not whole file.

Build-Up - 7 Steps

1

FoundationUnderstanding numpy arrays basics

Concept: Learn what numpy arrays are and how they store data in memory.

Numpy arrays are like tables of numbers stored in a block of memory. They let you do math fast because the data is stored continuously. For example, np.array([1, 2, 3]) creates a small array in memory.

Result

You get a fast, efficient array stored fully in RAM.

Knowing how numpy arrays live in memory helps you understand why large arrays can cause memory problems.

2

FoundationFile input/output with numpy

3

IntermediateIntroducing np.memmap for large arrays

4

IntermediateReading and writing with np.memmap

5

IntermediateModes and data types in np.memmap

6

AdvancedPerformance and memory usage trade-offs

7

ExpertInternal OS handling of memory-mapped files

Under the Hood

np.memmap creates a numpy array interface that points to a file on disk. It uses the OS's virtual memory system to map file contents into the process's address space. When you access parts of the array, the OS loads corresponding pages from disk into RAM. Writes update the file directly. This avoids loading the entire file and leverages OS caching.

Why designed this way?

Memory mapping was designed to handle large files efficiently without exhausting RAM. Using OS virtual memory avoids reinventing complex paging logic in Python. It also allows multiple processes to share data via the same file. Alternatives like loading full files or manual chunking are less efficient or more complex.

┌───────────────┐       ┌───────────────┐
│ np.memmap obj │──────▶│ OS virtual    │
│ (numpy array) │       │ memory system │
└───────────────┘       └───────────────┘
         │                        │
         │ Accesses pages         │ Loads pages
         ▼                        ▼
┌─────────────────────┐   ┌─────────────────┐
│ File on disk        │   │ RAM cache pages  │
│ (raw data bytes)    │   │ (loaded pages)   │
└─────────────────────┘   └─────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does np.memmap load the entire file into memory immediately? Commit yes or no.

Common Belief:np.memmap loads the whole file into RAM when opened.

Tap to reveal reality

Quick: Does writing to a np.memmap array only change memory until you save explicitly? Commit yes or no.

Common Belief:Changes to a memmap array are only in memory until you call a save function.

Tap to reveal reality

Quick: Can you change the shape of a memmap array after creation? Commit yes or no.

Common Belief:You can reshape a memmap array to any shape after loading.

Tap to reveal reality

Quick: Does memory mapping always speed up data access compared to loading in RAM? Commit yes or no.

Common Belief:Memory mapping is always faster than loading data fully into memory.

Tap to reveal reality

Expert Zone

1

Memory mapping performance depends heavily on the operating system's page cache and how it manages disk I/O.

2

Multiple processes can share the same memory-mapped file, enabling efficient inter-process communication without copying data.

3

The underlying file format and alignment affect how well memory mapping works; unaligned or compressed files are not suitable.

When NOT to use

Avoid np.memmap when working with compressed or encrypted files, or when random small reads dominate and speed is critical. Instead, use chunked loading, databases, or in-memory arrays if data fits.

Production Patterns

In production, np.memmap is used for large scientific datasets, machine learning training data, or image processing where datasets exceed RAM. It is combined with batch processing and OS tuning for best performance.

Connections

Virtual Memory in Operating Systems

np.memmap builds on OS virtual memory to map files into process address space.

Understanding OS virtual memory helps grasp how memory mapping loads data lazily and caches pages.

Out-of-Core Machine Learning

Memory mapping enables out-of-core algorithms that process data larger than RAM by loading chunks on demand.

Knowing memory mapping helps design scalable ML pipelines that handle big data efficiently.

Database Indexing

Like memory mapping, databases use indexes to access parts of data files without loading everything.

Seeing memory mapping as a low-level data access method clarifies how databases optimize large data retrieval.

Common Pitfalls

#1Trying to reshape a memmap array to a shape that doesn't match the file size.

Wrong approach:memmap_array = np.memmap('file.dat', dtype='float32', mode='r+', shape=(500, 500)) # file is actually 1000x1000 memmap_array = memmap_array.reshape((1000, 1000))

Correct approach:memmap_array = np.memmap('file.dat', dtype='float32', mode='r+', shape=(1000, 1000))

Root cause:Misunderstanding that the shape must match the file's actual data layout exactly.

#2Opening a memmap file in read-only mode but trying to write to it.

Wrong approach:memmap_array = np.memmap('file.dat', dtype='float32', mode='r', shape=(1000, 1000)) memmap_array[0,0] = 10 # Attempt to write

Correct approach:memmap_array = np.memmap('file.dat', dtype='float32', mode='r+', shape=(1000, 1000)) memmap_array[0,0] = 10

Root cause:Not matching file access mode to intended operations.

#3Assuming memmap loads entire file into memory and running out of RAM.

Wrong approach:Loading a 10GB file with np.load('file.npy') instead of np.memmap.

Correct approach:Using memmap: memmap_array = np.memmap('file.npy', dtype='float32', mode='r', shape=(large_shape))

Root cause:Not knowing the difference between np.load and np.memmap for large files.

Key Takeaways

Memory-mapped files let you work with huge arrays on disk as if they were in memory, loading only needed parts.

np.memmap relies on the operating system's virtual memory system to manage loading and caching transparently.

You must match the data type, shape, and file access mode exactly to avoid errors or data corruption.

Memory mapping improves memory efficiency but may not always speed up access, especially for random small reads.

Understanding memory mapping opens doors to handling big data efficiently in scientific and machine learning applications.