0
0
NumPydata~15 mins

Memory-mapped files with np.memmap in NumPy - Deep Dive

Choose your learning style9 modes available
Overview - Memory-mapped files with np.memmap
What is it?
Memory-mapped files with np.memmap allow you to work with large arrays stored on disk as if they were in memory. Instead of loading the entire file into RAM, only parts needed are loaded on demand. This helps handle data too big to fit in memory at once. You can read and write to these files efficiently without using much RAM.
Why it matters
Without memory-mapped files, working with very large datasets would require loading everything into memory, which can crash programs or slow them down. Memory mapping solves this by letting you access data directly on disk, saving memory and speeding up processing. This is crucial for big data, scientific computing, or any task with huge arrays.
Where it fits
Before learning np.memmap, you should understand basic numpy arrays and file input/output. After mastering memory mapping, you can explore advanced data handling techniques like chunking, out-of-core computation, and integration with databases or distributed systems.
Mental Model
Core Idea
Memory-mapped files let you treat parts of a large file on disk as if they were in memory, loading only what you need when you need it.
Think of it like...
It's like having a huge book on a shelf and only pulling out the pages you want to read instead of carrying the whole book around.
┌─────────────────────────────┐
│ Large file on disk          │
│ ┌───────────────────────┐ │
│ │ Memory-mapped array    │ │
│ │ ┌───────────────┐     │ │
│ │ │ Loaded pages  │◄────┤ │
│ │ └───────────────┘     │ │
│ └───────────────────────┘ │
└─────────────────────────────┘

Access only needed parts, not whole file.
Build-Up - 7 Steps
1
FoundationUnderstanding numpy arrays basics
🤔
Concept: Learn what numpy arrays are and how they store data in memory.
Numpy arrays are like tables of numbers stored in a block of memory. They let you do math fast because the data is stored continuously. For example, np.array([1, 2, 3]) creates a small array in memory.
Result
You get a fast, efficient array stored fully in RAM.
Knowing how numpy arrays live in memory helps you understand why large arrays can cause memory problems.
2
FoundationFile input/output with numpy
🤔
Concept: Learn how numpy saves and loads arrays from disk using simple functions.
You can save arrays to files with np.save('file.npy', array) and load them back with np.load('file.npy'). This reads or writes the whole array at once.
Result
You can persist arrays on disk and reload them later.
Understanding basic file I/O shows the limitation: loading big arrays fully can use too much memory.
3
IntermediateIntroducing np.memmap for large arrays
🤔Before reading on: do you think np.memmap loads the entire file into memory or only parts? Commit to your answer.
Concept: np.memmap creates an array-like object that accesses data on disk without loading it all into RAM.
Using np.memmap('file.dat', dtype='float32', mode='r+', shape=(1000,1000)) creates a memory-mapped array. You can read or write parts of it, and only those parts load into memory.
Result
You get an array interface to a file that uses little RAM even for huge data.
Understanding that np.memmap loads data lazily helps you handle datasets bigger than your RAM.
4
IntermediateReading and writing with np.memmap
🤔Before reading on: do you think writing to a np.memmap array changes the file immediately or only in memory? Commit to your answer.
Concept: Changes to a np.memmap array write directly to the file on disk, not just memory.
When you assign values to parts of a memmap array, those changes update the file. For example, memmap_array[0,0] = 42 writes 42 to the file immediately.
Result
You can modify large files efficiently without loading them fully.
Knowing that writes affect the file directly prevents confusion about data persistence.
5
IntermediateModes and data types in np.memmap
🤔
Concept: Learn about different file access modes and importance of matching data types.
np.memmap supports modes like 'r' (read-only), 'r+' (read-write), and 'w+' (create new). The dtype and shape must match the file's data layout exactly to avoid errors.
Result
You can safely open files with correct permissions and data formats.
Understanding modes and dtypes avoids common bugs like data corruption or crashes.
6
AdvancedPerformance and memory usage trade-offs
🤔Before reading on: do you think memory mapping always speeds up data access? Commit to your answer.
Concept: Memory mapping can speed up access but may be slower for random small reads compared to in-memory arrays.
Memory mapping uses the operating system's virtual memory system. Sequential reads are fast, but random access may cause delays due to disk I/O. Also, OS caching affects performance.
Result
You understand when memory mapping helps and when it might slow you down.
Knowing performance trade-offs helps you choose memory mapping wisely for your workload.
7
ExpertInternal OS handling of memory-mapped files
🤔Before reading on: do you think np.memmap manages disk reads itself or relies on the OS? Commit to your answer.
Concept: np.memmap relies on the operating system's virtual memory to load and cache file parts transparently.
When you access a memmap array, the OS loads the needed pages from disk into RAM. It also caches pages to speed up repeated access. This means np.memmap is a thin wrapper over OS memory management.
Result
You realize np.memmap's behavior depends heavily on OS paging and caching.
Understanding OS involvement explains why memory mapping performance varies and how to tune it.
Under the Hood
np.memmap creates a numpy array interface that points to a file on disk. It uses the OS's virtual memory system to map file contents into the process's address space. When you access parts of the array, the OS loads corresponding pages from disk into RAM. Writes update the file directly. This avoids loading the entire file and leverages OS caching.
Why designed this way?
Memory mapping was designed to handle large files efficiently without exhausting RAM. Using OS virtual memory avoids reinventing complex paging logic in Python. It also allows multiple processes to share data via the same file. Alternatives like loading full files or manual chunking are less efficient or more complex.
┌───────────────┐       ┌───────────────┐
│ np.memmap obj │──────▶│ OS virtual    │
│ (numpy array) │       │ memory system │
└───────────────┘       └───────────────┘
         │                        │
         │ Accesses pages         │ Loads pages
         ▼                        ▼
┌─────────────────────┐   ┌─────────────────┐
│ File on disk        │   │ RAM cache pages  │
│ (raw data bytes)    │   │ (loaded pages)   │
└─────────────────────┘   └─────────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does np.memmap load the entire file into memory immediately? Commit yes or no.
Common Belief:np.memmap loads the whole file into RAM when opened.
Tap to reveal reality
Reality:np.memmap only loads parts of the file on demand, not the entire file.
Why it matters:Believing this causes unnecessary fear of memory use and may prevent using memory mapping when it would help.
Quick: Does writing to a np.memmap array only change memory until you save explicitly? Commit yes or no.
Common Belief:Changes to a memmap array are only in memory until you call a save function.
Tap to reveal reality
Reality:Writes to a memmap array update the file on disk immediately or when the OS flushes pages.
Why it matters:Misunderstanding this can cause data loss if you think changes are temporary and don't handle files properly.
Quick: Can you change the shape of a memmap array after creation? Commit yes or no.
Common Belief:You can reshape a memmap array to any shape after loading.
Tap to reveal reality
Reality:The shape must match the file's data layout exactly; reshaping arbitrarily can cause errors or corrupt data.
Why it matters:Incorrect reshaping leads to crashes or wrong data interpretation.
Quick: Does memory mapping always speed up data access compared to loading in RAM? Commit yes or no.
Common Belief:Memory mapping is always faster than loading data fully into memory.
Tap to reveal reality
Reality:Memory mapping can be slower for random small reads due to disk access overhead.
Why it matters:Overestimating speed can lead to poor performance if memory mapping is used inappropriately.
Expert Zone
1
Memory mapping performance depends heavily on the operating system's page cache and how it manages disk I/O.
2
Multiple processes can share the same memory-mapped file, enabling efficient inter-process communication without copying data.
3
The underlying file format and alignment affect how well memory mapping works; unaligned or compressed files are not suitable.
When NOT to use
Avoid np.memmap when working with compressed or encrypted files, or when random small reads dominate and speed is critical. Instead, use chunked loading, databases, or in-memory arrays if data fits.
Production Patterns
In production, np.memmap is used for large scientific datasets, machine learning training data, or image processing where datasets exceed RAM. It is combined with batch processing and OS tuning for best performance.
Connections
Virtual Memory in Operating Systems
np.memmap builds on OS virtual memory to map files into process address space.
Understanding OS virtual memory helps grasp how memory mapping loads data lazily and caches pages.
Out-of-Core Machine Learning
Memory mapping enables out-of-core algorithms that process data larger than RAM by loading chunks on demand.
Knowing memory mapping helps design scalable ML pipelines that handle big data efficiently.
Database Indexing
Like memory mapping, databases use indexes to access parts of data files without loading everything.
Seeing memory mapping as a low-level data access method clarifies how databases optimize large data retrieval.
Common Pitfalls
#1Trying to reshape a memmap array to a shape that doesn't match the file size.
Wrong approach:memmap_array = np.memmap('file.dat', dtype='float32', mode='r+', shape=(500, 500)) # file is actually 1000x1000 memmap_array = memmap_array.reshape((1000, 1000))
Correct approach:memmap_array = np.memmap('file.dat', dtype='float32', mode='r+', shape=(1000, 1000))
Root cause:Misunderstanding that the shape must match the file's actual data layout exactly.
#2Opening a memmap file in read-only mode but trying to write to it.
Wrong approach:memmap_array = np.memmap('file.dat', dtype='float32', mode='r', shape=(1000, 1000)) memmap_array[0,0] = 10 # Attempt to write
Correct approach:memmap_array = np.memmap('file.dat', dtype='float32', mode='r+', shape=(1000, 1000)) memmap_array[0,0] = 10
Root cause:Not matching file access mode to intended operations.
#3Assuming memmap loads entire file into memory and running out of RAM.
Wrong approach:Loading a 10GB file with np.load('file.npy') instead of np.memmap.
Correct approach:Using memmap: memmap_array = np.memmap('file.npy', dtype='float32', mode='r', shape=(large_shape))
Root cause:Not knowing the difference between np.load and np.memmap for large files.
Key Takeaways
Memory-mapped files let you work with huge arrays on disk as if they were in memory, loading only needed parts.
np.memmap relies on the operating system's virtual memory system to manage loading and caching transparently.
You must match the data type, shape, and file access mode exactly to avoid errors or data corruption.
Memory mapping improves memory efficiency but may not always speed up access, especially for random small reads.
Understanding memory mapping opens doors to handling big data efficiently in scientific and machine learning applications.