Overview - Why documents are the unit of data

What is it?

In Elasticsearch, data is stored and managed as documents. A document is a basic unit of information that contains structured data in a format like JSON. Each document represents a single entity or record, such as a user profile or a product listing. This approach makes it easy to store, search, and retrieve data efficiently.

Why it matters

Using documents as the unit of data allows Elasticsearch to handle complex and varied information flexibly. Without documents, data would be stored in rigid tables or rows, making it harder to index and search quickly. Documents enable fast, full-text search and easy scaling, which is crucial for applications like search engines and analytics platforms.

Where it fits

Before learning why documents are the unit of data, you should understand basic data storage concepts and JSON format. After this, you can explore how Elasticsearch indexes documents and performs searches, and then learn about mapping and querying documents in detail.

Mental Model

Core Idea

A document is a self-contained package of data that Elasticsearch stores, indexes, and searches as a single unit.

Think of it like...

Think of a document like a single page in a filing cabinet, where each page holds all the information about one person or item. Instead of searching through the whole cabinet, you quickly find the right page with the details you need.

┌───────────────┐
│   Document    │
│ ┌───────────┐ │
│ │ Field 1   │ │
│ │ Field 2   │ │
│ │ Field 3   │ │
│ └───────────┘ │
└───────────────┘
Each document contains multiple fields with data.

Build-Up - 6 Steps

1

FoundationUnderstanding the Document Concept

Concept: Introduce what a document is in Elasticsearch and why it is the basic data unit.

A document in Elasticsearch is a JSON object that holds data about one entity. For example, a document could represent a book with fields like title, author, and year. Documents are stored in indexes and can be searched individually.

Result

You understand that documents are the smallest pieces of data Elasticsearch works with.

Knowing that documents are self-contained helps you see why Elasticsearch can quickly find and retrieve data without scanning unrelated information.

2

FoundationJSON Format as Document Structure

3

IntermediateDocuments vs. Traditional Rows

4

IntermediateIndexing Documents for Fast Search

5

AdvancedDocument Immutability and Updates

6

ExpertDocument Storage and Sharding Internals

Under the Hood

Elasticsearch stores each document as a JSON object indexed by an inverted index structure. When a document is added, its fields are tokenized and mapped to terms in the index, pointing to the document's location. Documents are immutable; updates create new versions. The index is split into shards, each managed by a node, enabling distributed storage and search.

Why designed this way?

Documents as units allow flexible, schema-less data storage suited for varied real-world data. Immutability simplifies concurrency and indexing. Sharding supports horizontal scaling and fault tolerance. Alternatives like fixed tables or mutable records would limit flexibility and scalability.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│   Document 1  │──────▶│   Shard 1     │──────▶│   Node A      │
│   Document 2  │──────▶│   Shard 2     │──────▶│   Node B      │
│   Document 3  │──────▶│   Shard 3     │──────▶│   Node C      │
└───────────────┘       └───────────────┘       └───────────────┘
Each document is indexed and stored in a shard distributed across nodes.

Myth Busters - 4 Common Misconceptions

Quick: Do you think all documents in an index must have the same fields? Commit to yes or no.

Common Belief:All documents in an Elasticsearch index must have the same fields and structure.

Tap to reveal reality

Quick: Do you think updating a document changes it instantly in place? Commit to yes or no.

Common Belief:When you update a document, Elasticsearch modifies it directly in storage.

Tap to reveal reality

Quick: Do you think Elasticsearch searches all documents linearly? Commit to yes or no.

Common Belief:Elasticsearch searches by scanning every document one by one.

Tap to reveal reality

Quick: Do you think documents are stored on a single server only? Commit to yes or no.

Common Belief:All documents in an index are stored on one server or node.

Tap to reveal reality

Expert Zone

1

Documents can have nested objects and arrays, but querying nested fields requires special handling to avoid incorrect matches.

2

Elasticsearch merges segments of indexed documents in the background to optimize search speed and storage, which affects how quickly deleted documents free space.

3

Mapping conflicts can occur if documents with the same field name have different data types, requiring careful index design.

When NOT to use

Using documents as the unit is less suitable when strict relational integrity or complex multi-table joins are needed; traditional relational databases with normalized tables are better in those cases.

Production Patterns

In production, documents are designed to be denormalized, containing all needed data to avoid joins. Index templates and mappings enforce field types. Shard count and replication are tuned for performance and reliability. Bulk APIs are used for efficient document ingestion.

Connections

JSON Data Format

Documents are structured as JSON objects, building directly on JSON syntax and semantics.

Understanding JSON helps grasp how documents store complex, nested data flexibly.

Inverted Index

Documents are indexed using inverted indexes to enable fast full-text search.

Knowing inverted indexes explains why documents can be searched quickly despite large data volumes.

Library Cataloging Systems

Like documents in Elasticsearch, library cards represent individual books with metadata for quick lookup.

Seeing documents as catalog cards helps understand how indexing and retrieval work in search systems.

Common Pitfalls

#1Trying to update a document by changing fields directly without reindexing.

Wrong approach:POST /index/_update/1 { "doc": { "title": "New Title" } } // expecting in-place change

Correct approach:POST /index/_doc/1 { "title": "New Title", "other_fields": "..." } // reindex entire document

Root cause:Misunderstanding that Elasticsearch treats documents as immutable and requires full document replacement on update.

#2Assuming all documents must have identical fields and failing to index documents with new fields.

Wrong approach:Indexing documents with different fields without updating mappings, causing errors or ignored fields.

Correct approach:Define dynamic mappings or update mappings to accommodate new fields before indexing varied documents.

Root cause:Confusing Elasticsearch's flexible schema with fixed relational schemas.

#3Searching without using the inverted index, expecting linear scan.

Wrong approach:Using scripts or filters that force scanning all documents for simple keyword search.

Correct approach:Use full-text queries that leverage inverted indexes for efficient search.

Root cause:Not leveraging Elasticsearch's indexing capabilities properly.

Key Takeaways

Documents are the basic units of data in Elasticsearch, storing all information about one entity in JSON format.

Using documents allows flexible, schema-less data storage that can handle varied and nested data easily.

Documents are indexed for fast search using inverted indexes, enabling quick retrieval without scanning all data.

Elasticsearch treats documents as immutable, replacing them on updates to simplify concurrency and indexing.

Documents are distributed across shards and nodes, allowing Elasticsearch to scale and remain reliable.