0
0
MongoDBquery~15 mins

Memory and storage engine basics (WiredTiger) in MongoDB - Deep Dive

Choose your learning style9 modes available
Overview - Memory and storage engine basics (WiredTiger)
What is it?
WiredTiger is the default storage engine used by MongoDB to manage how data is stored and accessed on disk and in memory. It controls how data is written, read, and cached to provide fast and reliable database operations. WiredTiger uses a combination of in-memory caching and on-disk storage to balance speed and durability.
Why it matters
Without a storage engine like WiredTiger, MongoDB would not efficiently handle large amounts of data or provide fast access to it. The storage engine solves the problem of managing data safely and quickly, even when many users access the database at the same time. Without it, databases would be slow, unreliable, and prone to data loss.
Where it fits
Before learning about WiredTiger, you should understand basic database concepts like collections and documents in MongoDB. After this, you can explore advanced topics like indexing, replication, and performance tuning that build on how WiredTiger manages data.
Mental Model
Core Idea
WiredTiger acts like a smart librarian who organizes, caches, and safely stores books (data) so readers (queries) get fast and reliable access.
Think of it like...
Imagine a library where the librarian keeps popular books on a special shelf nearby (memory cache) for quick access, while less-used books stay in the main stacks (disk). The librarian also carefully tracks changes to books to avoid losing any information.
┌─────────────────────────────┐
│        WiredTiger Engine     │
├─────────────┬───────────────┤
│  Memory     │  Disk Storage │
│  Cache      │  Filesystem   │
│ (Hot Data)  │ (Persistent)  │
├─────────────┴───────────────┤
│  Transaction Logs & Checkpoints│
└─────────────────────────────┘
Build-Up - 7 Steps
1
FoundationWhat is a Storage Engine?
🤔
Concept: Introduction to the role of a storage engine in a database.
A storage engine is the part of a database that handles how data is saved on disk and how it is retrieved. It decides how to organize data files, manage memory, and ensure data is safe even if the system crashes.
Result
You understand that a storage engine is essential for storing and accessing data efficiently and safely.
Knowing what a storage engine does helps you see why databases need specialized components beyond just storing data.
2
FoundationMemory and Disk Basics
🤔
Concept: Understanding the difference between memory (RAM) and disk storage.
Memory is fast but temporary storage used to hold data while the database is running. Disk storage is slower but permanent, keeping data safe even when the computer is off. A good storage engine balances using both for speed and durability.
Result
You grasp why databases use memory for quick access and disk for long-term storage.
Recognizing the trade-off between speed and permanence is key to understanding how WiredTiger works.
3
IntermediateWiredTiger’s Cache Management
🤔Before reading on: do you think WiredTiger keeps all data in memory or only some? Commit to your answer.
Concept: WiredTiger uses a cache in memory to store frequently accessed data for faster reads and writes.
WiredTiger allocates a portion of system memory as a cache. It keeps 'hot' data here so queries can access it quickly without reading from disk every time. When data changes, WiredTiger updates the cache first and later writes changes to disk.
Result
You see how caching improves performance by reducing slow disk reads.
Understanding caching explains why some queries are fast and how WiredTiger balances memory use with disk storage.
4
IntermediateTransactions and Data Safety
🤔Before reading on: do you think WiredTiger writes data to disk immediately or waits? Commit to your answer.
Concept: WiredTiger uses transactions and journaling to keep data safe and consistent.
When data changes, WiredTiger groups these changes into transactions. It writes a journal (log) of changes to disk before applying them fully. This way, if the system crashes, WiredTiger can recover by replaying the journal to avoid data loss or corruption.
Result
You understand how WiredTiger ensures data is not lost even during crashes.
Knowing about transactions and journaling reveals how WiredTiger balances speed with reliability.
5
IntermediateCompression in WiredTiger
🤔
Concept: WiredTiger compresses data to save disk space and improve performance.
WiredTiger can compress data before writing it to disk. This reduces the amount of storage used and can speed up reads and writes because less data moves between disk and memory. Compression is automatic and configurable.
Result
You learn how compression helps WiredTiger use resources efficiently.
Recognizing compression’s role shows how WiredTiger optimizes storage without sacrificing speed.
6
AdvancedCheckpointing and Recovery
🤔Before reading on: do you think WiredTiger writes all changes to disk immediately or periodically? Commit to your answer.
Concept: WiredTiger periodically saves a consistent snapshot of data to disk called a checkpoint.
Instead of writing every change immediately, WiredTiger creates checkpoints at intervals. A checkpoint is a stable image of the database on disk. If a crash happens, WiredTiger uses the last checkpoint plus the journal to restore data quickly and safely.
Result
You understand how checkpoints improve recovery speed and reduce disk write overhead.
Knowing checkpointing helps you appreciate WiredTiger’s design for balancing performance and durability.
7
ExpertConcurrency Control with MVCC
🤔Before reading on: do you think WiredTiger locks the whole database for writes or allows many operations at once? Commit to your answer.
Concept: WiredTiger uses Multi-Version Concurrency Control (MVCC) to allow many reads and writes simultaneously without conflicts.
MVCC means WiredTiger keeps multiple versions of data so readers can access a stable snapshot while writers make changes. This avoids locking the entire database and improves performance in multi-user environments. It also helps maintain data consistency.
Result
You see how WiredTiger supports high concurrency and fast operations.
Understanding MVCC reveals how WiredTiger handles complex workloads efficiently without blocking users.
Under the Hood
WiredTiger manages data by keeping a cache in memory for fast access and writing changes to disk files in a structured format. It uses a write-ahead log (journal) to record changes before applying them, ensuring durability. The engine employs MVCC to handle multiple versions of data, allowing concurrent reads and writes without locking conflicts. Periodic checkpoints create stable snapshots on disk for quick recovery.
Why designed this way?
WiredTiger was designed to improve on older storage engines by providing better concurrency, compression, and crash recovery. The use of MVCC and caching was chosen to maximize performance in modern multi-core systems. Write-ahead logging and checkpoints balance durability with speed, avoiding the cost of writing every change immediately. Alternatives like simple locking or no compression were rejected because they limit scalability and efficiency.
┌───────────────┐       ┌───────────────┐
│   Client     │──────▶ │ WiredTiger    │
│  Queries     │       │ Storage Engine│
└───────────────┘       └──────┬────────┘
                                │
               ┌────────────────┴───────────────┐
               │                                │
        ┌──────▼───────┐                ┌───────▼───────┐
        │  Memory Cache │                │  Disk Storage │
        │ (Hot Data)   │                │ (Data Files & │
        │              │                │  Journals)    │
        └──────────────┘                └───────────────┘
               │                                ▲
               └─────────────┬──────────────────┘
                             │
                      ┌──────▼───────┐
                      │ Checkpoints  │
                      │ & Recovery   │
                      └──────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does WiredTiger write every data change immediately to disk? Commit yes or no.
Common Belief:WiredTiger writes every change to disk immediately to ensure data safety.
Tap to reveal reality
Reality:WiredTiger uses caching and checkpoints, so not every change is written immediately; it writes changes to a journal first and flushes data periodically.
Why it matters:Believing immediate writes happen can lead to misunderstanding performance behavior and tuning, causing inefficient configurations.
Quick: Does WiredTiger lock the entire database for writes? Commit yes or no.
Common Belief:WiredTiger locks the whole database during writes to prevent conflicts.
Tap to reveal reality
Reality:WiredTiger uses MVCC to allow concurrent reads and writes without locking the entire database.
Why it matters:Assuming full locks can cause developers to avoid concurrency optimizations and misunderstand performance bottlenecks.
Quick: Is compression in WiredTiger optional or mandatory? Commit your answer.
Common Belief:Compression is mandatory and always enabled in WiredTiger.
Tap to reveal reality
Reality:Compression is optional and configurable; users can enable or disable it based on needs.
Why it matters:Misunderstanding compression can lead to unexpected storage use or performance issues if not configured properly.
Quick: Does WiredTiger store all data versions forever? Commit yes or no.
Common Belief:WiredTiger keeps all versions of data indefinitely for safety.
Tap to reveal reality
Reality:WiredTiger keeps multiple versions temporarily for concurrency but cleans up old versions to save space.
Why it matters:Thinking versions accumulate forever can cause concerns about storage growth and misinterpretation of system behavior.
Expert Zone
1
WiredTiger’s cache size is configurable and directly impacts performance and memory usage; tuning it requires balancing workload and system resources.
2
The choice of compression algorithm (snappy, zlib, zstd) affects CPU usage and storage savings, influencing overall system throughput.
3
Checkpoint frequency affects recovery time and write performance; too frequent checkpoints increase overhead, too infrequent increase recovery time.
When NOT to use
WiredTiger may not be ideal for workloads requiring extremely low latency writes with minimal CPU overhead; in such cases, alternative engines like in-memory storage engines or specialized NoSQL engines might be better.
Production Patterns
In production, WiredTiger is often tuned with custom cache sizes and compression settings based on workload. It is combined with replica sets for high availability and uses monitoring tools to adjust checkpoint intervals and cache usage dynamically.
Connections
Operating System Virtual Memory
WiredTiger’s cache management builds on OS virtual memory concepts to efficiently use RAM.
Understanding OS memory management helps grasp how WiredTiger balances memory use between cache and system needs.
Version Control Systems
WiredTiger’s MVCC is similar to how version control systems keep multiple versions of files to allow safe concurrent edits.
Knowing version control concepts clarifies how WiredTiger manages multiple data versions for concurrency.
Library Book Lending
Like a librarian managing popular and archived books, WiredTiger manages hot data in cache and cold data on disk.
This cross-domain connection shows how organizing resources for quick access and safe storage is a universal challenge.
Common Pitfalls
#1Setting WiredTiger cache size too large, leaving little memory for the OS.
Wrong approach:storage.wiredTiger.engineConfig.cacheSizeGB: 30 # On a 32GB RAM server
Correct approach:storage.wiredTiger.engineConfig.cacheSizeGB: 20 # Leaves memory for OS and other processes
Root cause:Misunderstanding that WiredTiger cache shares system memory, causing OS to swap and degrade performance.
#2Disabling journaling to improve write speed without understanding risks.
Wrong approach:journal.enabled: false
Correct approach:journal.enabled: true
Root cause:Ignoring that journaling protects data integrity and disables crash recovery.
#3Assuming compression always improves performance and enabling it blindly.
Wrong approach:storage.wiredTiger.collectionConfig.blockCompressor: zlib # On CPU-limited system
Correct approach:storage.wiredTiger.collectionConfig.blockCompressor: snappy # Balanced CPU and compression
Root cause:Not considering CPU cost of compression algorithms leading to CPU bottlenecks.
Key Takeaways
WiredTiger is the engine that manages how MongoDB stores and accesses data efficiently using memory and disk.
It uses caching, transactions, journaling, and checkpoints to balance speed, safety, and durability.
Multi-Version Concurrency Control allows many users to read and write data at the same time without conflicts.
Compression and checkpointing optimize storage and recovery but require tuning based on workload.
Understanding WiredTiger’s internals helps in configuring MongoDB for better performance and reliability.