Overview - Inverted index data structure

What is it?

An inverted index is a data structure used to quickly find documents that contain specific words. It works by mapping each word to a list of documents where that word appears. This makes searching very fast, especially in large collections of text. Elasticsearch uses inverted indexes to power its full-text search capabilities.

Why it matters

Without inverted indexes, searching for words in large text collections would be very slow because the system would have to scan every document. This would make search engines and applications that rely on quick text search frustratingly slow or unusable. Inverted indexes solve this by organizing data to find matches instantly.

Where it fits

Before learning about inverted indexes, you should understand basic data structures like arrays and hash maps, and have a general idea of how search works. After this, you can learn about Elasticsearch’s query language and how inverted indexes support complex search features like ranking and filtering.

Mental Model

Core Idea

An inverted index flips the usual document-to-word relationship by listing each word first, then all documents containing it, enabling lightning-fast search.

Think of it like...

Imagine a library where instead of searching every book page by page, you have a card catalog listing each word and all the books that contain it. You just look up the word in the catalog and instantly find all relevant books.

Words → Documents
┌─────────────┐
│ apple       │ → doc1, doc3, doc7
│ banana      │ → doc2, doc4
│ cherry      │ → doc1, doc5, doc6
└─────────────┘

Build-Up - 6 Steps

1

FoundationWhat is an inverted index?

Concept: Introducing the basic idea of mapping words to documents.

Normally, documents list words they contain. An inverted index reverses this: it lists each word and all documents where it appears. This helps find documents by word quickly.

Result

You get a list of words, each linked to documents containing them.

Understanding this reversal is key to grasping how search engines find text fast.

2

FoundationHow documents and words relate

3

IntermediateBuilding the inverted index structure

4

IntermediateUsing inverted index for search queries

5

AdvancedOptimizations inside inverted indexes

6

ExpertHandling updates and scaling inverted indexes

Under the Hood

Internally, Elasticsearch breaks text into tokens and builds a dictionary mapping each token to a postings list of document IDs and positions. These postings lists are compressed and stored in segments. Search queries access these lists to quickly find matching documents without scanning all data.

Why designed this way?

The inverted index was designed to solve the slow search problem in large text collections. By reversing the document-to-word mapping, it allows direct access to documents containing a word. Alternatives like scanning all documents were too slow. Segment-based storage balances update speed and search efficiency.

┌───────────────┐
│ Text Documents│
└──────┬────────┘
       │ Tokenize & Normalize
       ▼
┌───────────────┐
│ Inverted Index│
│ Word → Docs   │
└──────┬────────┘
       │ Search Queries
       ▼
┌───────────────┐
│ Matching Docs │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does an inverted index store the full text of documents? Commit yes or no.

Common Belief:An inverted index stores the entire text of each document for searching.

Tap to reveal reality

Quick: When searching multiple words, does the system scan all documents or combine word lists? Commit your answer.

Common Belief:Searching multiple words means scanning every document to check for matches.

Tap to reveal reality

Quick: Does the inverted index store words exactly as typed in documents? Commit yes or no.

Common Belief:The index stores words exactly as they appear, including case and punctuation.

Tap to reveal reality

Quick: Are inverted indexes rebuilt from scratch on every document update? Commit yes or no.

Common Belief:Every time a document changes, the entire inverted index is rebuilt.

Tap to reveal reality

Expert Zone

1

The order of document IDs in postings lists is often sorted to enable fast intersection during multi-word queries.

2

Compression techniques like gap encoding reduce the size of postings lists, improving storage and speed.

3

Segment merging is a background process balancing write performance and search speed, and tuning it affects system behavior.

When NOT to use

Inverted indexes are not suitable for data without searchable text or for exact-match key-value lookups where hash indexes or B-trees are better. For numeric range queries, specialized indexes like BKD trees are preferred.

Production Patterns

In production, inverted indexes are combined with filters and caching to speed up repeated queries. They are sharded across nodes for horizontal scaling. Real systems tune analyzers and segment merging to balance search relevance, speed, and update latency.

Connections

Hash Map

Both map keys to values; inverted index maps words to document lists, hash maps map keys to single values.

Understanding hash maps helps grasp how inverted indexes quickly find document lists by word keys.

Library Card Catalog

The inverted index functions like a card catalog listing words and their books.

Knowing how libraries organize books by subject helps understand how inverted indexes organize documents by words.

Biology - DNA Sequencing

Both involve indexing sequences (words or DNA fragments) to quickly find matches in large datasets.

Recognizing that indexing sequences is a common pattern across fields reveals the universal power of inverted indexes.

Common Pitfalls

#1Trying to search text without an inverted index, scanning all documents.

Wrong approach:SELECT * FROM documents WHERE content LIKE '%apple%';

Correct approach:Use Elasticsearch query: { "match": { "content": "apple" } }

Root cause:Not understanding that full-text search requires specialized indexes for speed.

#2Indexing raw text without tokenization or normalization.

Wrong approach:Indexing documents as-is without analyzers in Elasticsearch.

Correct approach:Use analyzers to tokenize and normalize text before indexing.

Root cause:Ignoring text processing leads to poor search relevance and larger indexes.

#3Rebuilding the entire index on every document update.

Wrong approach:Deleting and recreating the whole index for small changes.

Correct approach:Use Elasticsearch’s incremental update and segment merging features.

Root cause:Misunderstanding how Elasticsearch manages index updates efficiently.

Key Takeaways

An inverted index reverses the usual document-to-word mapping to enable fast text search.

It stores words with lists of documents containing them, not full document text.

Text is processed before indexing to improve search quality and efficiency.

Elasticsearch manages updates using segments to keep search fast and scalable.

Understanding inverted indexes is essential for grasping how modern search engines work.