0
0
HLDsystem_design~15 mins

Database indexing in HLD - Deep Dive

Choose your learning style9 modes available
Overview - Database indexing
What is it?
Database indexing is a technique that helps databases find data quickly without scanning every row. It works like a book's index, pointing to where information is stored. Indexes store keys and pointers to data, making searches faster. Without indexes, databases would be slow and inefficient for large data.
Why it matters
Without indexing, searching data in large databases would be like reading every page of a book to find a word. This would make applications slow and frustrating. Indexing solves this by allowing quick lookups, improving user experience and system performance. It also reduces server load and costs by avoiding unnecessary work.
Where it fits
Before learning indexing, you should understand basic database concepts like tables, rows, and queries. After mastering indexing, you can explore query optimization, database normalization, and advanced storage engines. Indexing is a key step in designing scalable and efficient databases.
Mental Model
Core Idea
Indexing is like creating a fast lookup guide that points directly to data locations, avoiding full scans.
Think of it like...
Imagine a library where books are not arranged by topic but randomly. To find a book, you'd check every shelf. An index is like a catalog that tells you exactly which shelf and position the book is on, saving time.
┌───────────────┐       ┌───────────────┐
│   Data Table  │◄──────│    Index      │
│  (Rows)      │       │ (Keys + Pointers)│
└───────────────┘       └───────────────┘
        ▲                      ▲
        │                      │
   Full scan             Quick lookup
Build-Up - 7 Steps
1
FoundationWhat is a database index
🤔
Concept: Introduce the basic idea of an index as a data structure that speeds up data retrieval.
A database index is a separate data structure that stores a sorted list of keys and pointers to the actual data rows. When you search for data, the database uses the index to find the location quickly instead of scanning every row.
Result
Queries that use indexed columns run much faster because the database jumps directly to the data location.
Understanding that an index is a separate structure explains why adding indexes can speed up searches but also requires extra storage.
2
FoundationTypes of database indexes
🤔
Concept: Explain common index types and their basic differences.
The most common index type is a B-tree index, which keeps keys sorted for fast range and equality searches. Another type is a hash index, which is very fast for exact matches but not for ranges. Some databases also support bitmap indexes for columns with few unique values.
Result
Knowing index types helps choose the right index for your query patterns.
Recognizing that different index types suit different queries prevents inefficient indexing choices.
3
IntermediateHow indexes speed up queries
🤔Before reading on: do you think indexes reduce the number of rows scanned or just organize data better? Commit to your answer.
Concept: Show how indexes reduce the number of rows the database must check.
When a query uses an indexed column, the database searches the index structure (like a B-tree) to find the exact location of matching rows. This avoids scanning the entire table. For example, searching for a name in a phonebook index is faster than reading every entry.
Result
Queries become faster because the database reads fewer rows and less data.
Understanding that indexes reduce scanned rows clarifies why they improve performance so much.
4
IntermediateTradeoffs of using indexes
🤔Before reading on: do you think indexes always improve performance or can they sometimes slow things down? Commit to your answer.
Concept: Explain the costs and limitations of indexes.
Indexes speed up reads but slow down writes because the index must be updated on inserts, updates, and deletes. They also consume extra disk space. Too many indexes can hurt overall performance. Choosing which columns to index is a balance between read speed and write cost.
Result
Knowing tradeoffs helps design indexes that improve overall system performance.
Recognizing that indexes have costs prevents blindly adding them and causing slowdowns.
5
IntermediateComposite and covering indexes
🤔Before reading on: do you think an index on multiple columns works like multiple single-column indexes combined? Commit to your answer.
Concept: Introduce indexes on multiple columns and indexes that include all needed data.
A composite index stores keys made from multiple columns in order. It helps queries filtering on those columns together. A covering index includes all columns a query needs, so the database can answer the query using only the index without reading the table.
Result
Composite and covering indexes can make complex queries much faster.
Knowing these index types helps optimize queries that filter or select multiple columns.
6
AdvancedIndex internals and storage
🤔Before reading on: do you think indexes store full data rows or just keys and pointers? Commit to your answer.
Concept: Explain how indexes are stored and structured internally.
Most indexes use balanced tree structures like B-trees or B+ trees. These trees keep keys sorted and balanced for fast search, insert, and delete operations. Index nodes store keys and pointers to child nodes or data rows. Leaf nodes point directly to data locations. This structure minimizes disk reads.
Result
Understanding internals explains why indexes are fast and how they handle updates.
Knowing the tree structure helps predict index behavior under heavy load or large data.
7
ExpertAdvanced indexing strategies and surprises
🤔Before reading on: do you think all indexes improve query speed equally, or can some indexes mislead the optimizer? Commit to your answer.
Concept: Discuss advanced topics like index selectivity, optimizer interaction, and partial indexes.
Index selectivity measures how well an index filters data; low selectivity indexes may not help queries. Some databases support partial indexes that index only rows meeting a condition, saving space. The query optimizer decides whether to use an index or not, sometimes ignoring indexes if they don't help. Understanding these nuances is key for tuning.
Result
Experts can design indexes that truly improve performance and avoid wasted resources.
Knowing optimizer behavior and index selectivity prevents common indexing mistakes and surprises.
Under the Hood
Indexes are stored as balanced tree structures (like B-trees) that keep keys sorted and balanced. Each node contains keys and pointers to child nodes or data rows. Searching an index involves traversing from the root to leaf nodes, reducing disk reads. Updates modify the tree to keep it balanced. This structure allows fast lookups, inserts, and deletes.
Why designed this way?
B-trees were chosen because they minimize disk I/O by storing many keys per node, fitting disk pages. Balanced trees ensure search paths are short, keeping performance stable even as data grows. Alternatives like hash indexes are faster for exact matches but don't support range queries well, so B-trees offer a good balance.
┌───────────────┐
│    Root       │
│  (Keys + Ptr) │
└──────┬────────┘
       │
┌──────┴───────┐
│  Internal    │
│  Nodes       │
└──────┬───────┘
       │
┌──────┴───────┐
│   Leaf Nodes │
│ (Keys + Ptr) │
└──────┬───────┘
       │
┌──────┴───────┐
│  Data Rows   │
└──────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Do indexes always speed up every query? Commit to yes or no before reading on.
Common Belief:Indexes always make queries faster.
Tap to reveal reality
Reality:Indexes speed up many queries but can slow down some, especially writes or queries that don't use indexed columns.
Why it matters:Blindly adding indexes can degrade overall system performance and increase storage costs.
Quick: Does an index store the full data row? Commit to yes or no before reading on.
Common Belief:Indexes store complete copies of the data rows.
Tap to reveal reality
Reality:Indexes store only keys and pointers to data rows, not full data, except for covering indexes which include some columns.
Why it matters:Misunderstanding this leads to wrong assumptions about index size and update costs.
Quick: Is a composite index the same as multiple single-column indexes combined? Commit to yes or no before reading on.
Common Belief:A composite index is just multiple single-column indexes combined.
Tap to reveal reality
Reality:A composite index is a single index on multiple columns in a specific order, which behaves differently than separate indexes.
Why it matters:Using separate indexes instead of a composite one can cause inefficient query plans.
Quick: Can the database ignore an index even if it exists? Commit to yes or no before reading on.
Common Belief:If an index exists, the database always uses it for queries.
Tap to reveal reality
Reality:The query optimizer may ignore indexes if it estimates a full scan is cheaper.
Why it matters:Expecting indexes to always be used can lead to confusion when queries remain slow.
Expert Zone
1
Index selectivity is crucial; high selectivity indexes filter data well and improve performance, while low selectivity indexes may be ignored.
2
Partial indexes index only a subset of rows, saving space and improving performance for specific queries.
3
The order of columns in composite indexes affects which queries can use them efficiently.
When NOT to use
Avoid indexing columns with very low cardinality (few unique values) like boolean flags; use bitmap indexes or no index instead. For write-heavy tables, minimize indexes to reduce update overhead. Use full-text search engines or external search systems for complex text queries instead of relying solely on database indexes.
Production Patterns
In production, indexes are carefully chosen based on query patterns and monitored for usage. Partial and filtered indexes are used to optimize specific queries. Index maintenance tasks like rebuilding and statistics updates are scheduled regularly. Some systems use multi-column indexes and covering indexes to speed up complex queries.
Connections
Caching
Builds-on
Both caching and indexing aim to speed up data access by avoiding repeated work; understanding indexing helps grasp how caching stores frequently accessed data for quick retrieval.
File system directory structures
Similar pattern
File systems use tree structures to organize files for quick lookup, similar to how database indexes use B-trees to find data efficiently.
Library cataloging systems
Analogous system
Library catalogs organize books by author, title, or subject to help users find books quickly, just like indexes organize data keys to speed up database queries.
Common Pitfalls
#1Adding indexes on every column without analysis
Wrong approach:CREATE INDEX idx_all ON table(col1); CREATE INDEX idx_all2 ON table(col2); CREATE INDEX idx_all3 ON table(col3);
Correct approach:Analyze query patterns and add indexes only on columns frequently used in WHERE or JOIN clauses.
Root cause:Misunderstanding that more indexes always improve performance, ignoring write overhead and storage costs.
#2Creating a composite index with columns in wrong order
Wrong approach:CREATE INDEX idx_wrong ON table(col2, col1); -- Queries filter on col1 then col2
Correct approach:CREATE INDEX idx_correct ON table(col1, col2); -- Matches query filter order
Root cause:Not knowing that composite index column order affects which queries can use the index.
#3Expecting index to speed up queries with low selectivity columns
Wrong approach:CREATE INDEX idx_flag ON table(boolean_flag); -- flag has only true/false
Correct approach:Avoid indexing low cardinality columns or use bitmap indexes if supported.
Root cause:Assuming all indexes improve performance regardless of data distribution.
Key Takeaways
Database indexes are special data structures that speed up data retrieval by pointing directly to data locations.
Different index types and structures suit different query patterns and data types; choosing the right one is key.
Indexes improve read performance but add overhead to writes and storage, so balance is essential.
Understanding index internals like B-trees helps predict performance and design better indexes.
Advanced indexing strategies and optimizer behavior can surprise even experienced users, so continuous learning is important.