Overview - Memory-mapped arrays for large data

What is it?

Memory-mapped arrays let you work with very large data files on disk as if they were normal arrays in memory. Instead of loading the entire file into RAM, only parts you need are loaded on demand. This helps when your data is too big to fit in your computer's memory all at once. You can read, write, and process large datasets efficiently without crashing your program.

Why it matters

Without memory-mapped arrays, handling huge datasets would require loading everything into memory, which can cause your computer to slow down or run out of memory. This limits the size of data you can analyze. Memory mapping solves this by letting you access data directly on disk, making it possible to work with datasets much larger than your RAM. This is crucial for big data analysis, scientific computing, and machine learning on large files.

Where it fits

Before learning memory-mapped arrays, you should understand basic numpy arrays and how data is stored in memory. After this, you can explore advanced data handling techniques like chunking, out-of-core processing, and using libraries that build on memory mapping for big data workflows.

Mental Model

Core Idea

Memory-mapped arrays treat large files on disk like normal arrays by loading only needed parts into memory on demand.

Think of it like...

It's like reading a huge book by opening only the pages you want instead of carrying the whole book around.

┌─────────────────────────────┐
│ Large data file on disk     │
│ ┌───────────────────────┐ │
│ │ Memory-mapped array   │ │
│ │ ┌───────────────┐   │ │
│ │ │ Requested data │←──┼─┤
│ │ └───────────────┘   │ │
│ └───────────────────────┘ │
└─────────────────────────────┘

Only the requested parts of the file are loaded into RAM when accessed.

Build-Up - 7 Steps

1

FoundationUnderstanding numpy arrays basics

Concept: Learn what numpy arrays are and how they store data in memory.

Numpy arrays are like grids of numbers stored in your computer's memory. They let you do math on many numbers quickly. For example, np.array([1, 2, 3]) creates a small array in memory.

Result

You get a fast, efficient container for numbers that you can use for calculations.

Understanding numpy arrays is essential because memory-mapped arrays build on this concept but add disk storage.

2

FoundationLimits of memory with large arrays

3

IntermediateWhat is memory mapping in numpy

4

IntermediateReading and writing with memmap arrays

5

IntermediatePerformance considerations of memmap

6

AdvancedUsing memmap for out-of-core computation

7

ExpertInternal OS role in memory mapping

Under the Hood

Memory-mapped arrays use the operating system's virtual memory feature to map a file directly into the process's address space. Instead of reading the whole file, the OS loads only the memory pages accessed by the program. When you access an element, if its page is not in RAM, the OS reads it from disk automatically. Changes to writable mappings are written back to the file. This mechanism avoids copying data and reduces memory usage.

Why designed this way?

Memory mapping was designed to handle large files efficiently without loading them fully into RAM. Early computing faced memory limits, so mapping files into memory allowed programs to treat files like arrays without complex manual file I/O. Alternatives like loading full files or manual chunking were slower or more error-prone. Using OS virtual memory leverages existing efficient paging systems.

┌───────────────────────────────┐
│        Application code        │
│  ┌─────────────────────────┐  │
│  │ numpy.memmap object      │  │
│  └─────────────┬───────────┘  │
└───────────────│───────────────┘
                │
                ▼
┌───────────────────────────────┐
│ Operating System Virtual Memory│
│ ┌───────────────┐             │
│ │ Page Table    │◄────────────┤
│ └───────┬───────┘             │
│         │                     │
│         ▼                     │
│ ┌───────────────┐             │
│ │ Disk File     │             │
│ │ (data.dat)    │             │
│ └───────────────┘             │
└───────────────────────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does memory mapping load the entire file into RAM immediately? Commit to yes or no.

Common Belief:Memory-mapped arrays load the whole file into memory as soon as they are created.

Tap to reveal reality

Quick: Can you write to a memory-mapped array opened in 'r' mode? Commit to yes or no.

Common Belief:You can modify data in any memory-mapped array regardless of mode.

Tap to reveal reality

Quick: Is memory mapping always faster than loading data fully into RAM? Commit to yes or no.

Common Belief:Memory-mapped arrays always improve speed compared to loading data fully.

Tap to reveal reality

Quick: Does numpy handle all memory mapping internally without OS help? Commit to yes or no.

Common Belief:Numpy fully manages memory mapping without relying on the operating system.

Tap to reveal reality

Expert Zone

1

Memory-mapped arrays share the same underlying file, so multiple processes can access and modify data concurrently if managed carefully.

2

The OS page cache affects memmap performance; understanding how the OS caches disk pages helps optimize access patterns.

3

File system type and hardware (SSD vs HDD) significantly impact memory mapping speed and behavior.

When NOT to use

Memory-mapped arrays are not ideal when you need very fast random access to small data points repeatedly or when working with compressed or encrypted files. In such cases, loading data fully into RAM or using specialized databases or formats like HDF5 or Parquet is better.

Production Patterns

In production, memmap is used for large image processing, scientific simulations, and machine learning datasets that exceed RAM. It is combined with chunked processing, parallel computing, and caching strategies to handle big data efficiently.

Connections

Virtual Memory (Operating Systems)

Memory mapping relies on virtual memory to load file parts on demand.

Understanding OS virtual memory clarifies how memory mapping works and why it is efficient.

Out-of-Core Machine Learning

Memory-mapped arrays enable out-of-core algorithms that process data too large for RAM.

Knowing memmap helps implement scalable machine learning on big datasets.

Database Indexing

Both memory mapping and database indexing optimize access to large data without full loading.

Recognizing this connection helps appreciate different strategies for big data access.

Common Pitfalls

#1Trying to write to a memory-mapped array opened in read-only mode.

Wrong approach:mmap = np.memmap('data.dat', dtype='float32', mode='r', shape=(1000,1000)) mmap[0,0] = 10.0 # wrong

Correct approach:mmap = np.memmap('data.dat', dtype='float32', mode='r+', shape=(1000,1000)) mmap[0,0] = 10.0 # correct

Root cause:Misunderstanding the mode parameter and its effect on write permissions.

#2Assuming memory-mapped arrays load all data into RAM immediately.

Wrong approach:mmap = np.memmap('data.dat', dtype='float32', mode='r', shape=(1000000,1000)) # expecting full data in RAM now

Correct approach:mmap = np.memmap('data.dat', dtype='float32', mode='r', shape=(1000000,1000)) # data loads only when accessed

Root cause:Confusing memory mapping with full data loading.

#3Accessing memory-mapped array with random small reads causing slow performance.

Wrong approach:for i in range(1000000): val = mmap[i, 0] # many small random reads

Correct approach:for i in range(0, 1000000, 1000): batch = mmap[i:i+1000, 0] # batch reads for efficiency

Root cause:Not understanding disk access latency and OS page caching.

Key Takeaways

Memory-mapped arrays let you work with files on disk as if they were numpy arrays without loading everything into RAM.

They rely on the operating system's virtual memory to load only the parts of the file you access, saving memory.

Choosing the correct mode ('r', 'r+', 'w+') is crucial to avoid errors when reading or writing data.

Memory mapping improves memory efficiency but may slow down access if data is accessed randomly or in small pieces.

Combining memory mapping with chunked processing enables handling datasets larger than your computer's memory.